The rapid development of 3D acquisition technology has made it possible to obtain point clouds of real-world terrains. However, due to limitations in sensor acquisition technology or specific requirements, point clouds often contain defects such as holes with missing data. Inpainting algorithms are widely used to patch these holes. However, existing traditional inpainting algorithms rely on precise hole boundaries, which limits their ability to handle cases where the boundaries are not well-defined. On the other hand, learning-based completion methods often prioritize reconstructing the entire point cloud instead of solely focusing on hole filling. Based on the fact that real-world terrain exhibits both global smoothness and rich local detail, we propose a novel representation for terrain point clouds. This representation can help to repair the holes without clear boundaries. Specifically, it decomposes terrains into low-frequency and high-frequency components, which are represented by B-spline surfaces and relative height maps respectively. In this way, the terrain point cloud inpainting problem is transformed into a B-spline surface fitting and 2D image inpainting problem. By solving the two problems, the highly complex and irregular holes on the terrain point clouds can be well-filled, which not only satisfies the global terrain undulation but also exhibits rich geometric details. The experimental results also demonstrate the effectiveness of our method.
3D 采集技术的快速发展使得获取真实地形的三点云成为可能。然而,由于传感器采集技术的限制或具体要求,点云通常包含一些缺陷,如缺失数据导致的洞。为了修复这些洞,修复算法(inpainting algorithms)得到了广泛应用。然而,现有的传统修复算法依赖于精确的洞边界,这限制了它们在边界定义不明确的情况下的处理能力。另一方面,基于学习的修复方法通常优先重构整个点云,而不是仅仅关注洞填充。基于真实地形既表现出全局平滑性又富有局部细节的事实,我们提出了一个新颖的地形点云表示。这种表示可以帮助修复洞,而不仅仅是填充洞。具体来说,它将地形分解为低频和高频组件,分别用B-spline表面和相对高度图表示。这样,地形点云修复问题转化为B-spline表面拟合和2D图像修复问题。通过解决这两个问题,可以填充地形点云中的复杂且不规则的洞,不仅满足全局地形起伏,还展示了丰富的几何细节。实验结果也证明了我们的方法的有效性。
https://arxiv.org/abs/2404.03572
We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: this https URL
我们提出了GenN2N,一个统一的NeRF到NeRF翻译框架,用于各种NeRF翻译任务,如文本驱动的NeRF编辑、颜色化、超分辨率等。与之前为单独任务设计的具有任务特定方案的方法不同,GenN2N通过使用可插拔的图像到图像的翻译器在二维领域执行编辑并将2D编辑浮动到三维NeRF空间中,从而实现所有这些NeRF编辑任务。由于二维编辑的3D一致性可能无法保证,我们提出通过一个生成模型建模底层3D编辑的分布。为了从2D编辑图像中建模3D编辑的分布,我们仔细设计了一个VAE-GAN,它在解码NeRF的同时编码图像。隐空间通过归一化高斯分布进行训练,NeRFs通过在其渲染上应用对抗损失进行监督。为了确保隐码不依赖于2D视点,而是真正反映了3D编辑,我们还通过对比学习方案对隐码进行正则化。在各种编辑任务上的广泛实验表明,GenN2N作为一个通用框架,表现出色或者与任务特定专家相当,同时具有灵活的生成能力。更多结果请查看我们的项目页面:https:// this URL。
https://arxiv.org/abs/2404.02788
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimization problem-based methods. Our solution includes a new architecture operating in the FS latent space of StyleGAN, an enhanced inpainting approach, and improved encoders for better alignment, color transfer, and a new encoder for post-processing. The effectiveness of our approach is demonstrated on realism metrics after random hairstyle transfer and reconstruction when the original hairstyle is transferred. In the most difficult scenario of transferring both shape and color of a hairstyle from different images, our method performs in less than a second on the Nvidia V100. Our code is available at this https URL.
我们的论文解决了将发型从参考图像转移到输入照片进行虚拟试戴的复杂任务。由于需要适应各种照片姿势、发型敏感度和缺乏客观指标,这项任务非常具有挑战性。同时,目前基于优化过程的方法速度极慢,而快速编码的模型质量也非常低,因为它们要么在StyleGAN的W+空间运行,要么使用其他低维图像生成器。此外,两种方法在发型转移时都存在问题,因为它们要么完全忽略姿势,要么处理姿势不高效。在我们的论文中,我们提出了HairFast模型,它独特地解决了这些问题,并实现了高分辨率、接近实时性能和卓越的重建效果,与基于优化问题的方法相比。我们的解决方案包括一个新的在StyleGAN的FS潜在空间中运行的架构、增强的修复方法、改进的编码器以及后处理的新编码器。在随机发型转移和重建时,我们的方法在现实主义指标上证明了其有效性。在将不同图像的发型和颜色从一个图像转移到另一个图像的最困难的情况下,我们的方法在Nvidia V100上执行的时间不到一秒。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.01094
Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases.
基于图像的虚拟试穿变得越来越重要,它旨在合成特定人物穿着指定服装的图像。近年来,基于扩散模型的方法越来越受欢迎,因为它们在图像合成任务上表现出色。然而,这些方法通常需要使用额外的图像编码器,并且依赖于从服装到人物图像的跨注意机制进行纹理传递,这会影响试穿的效率和准确性。为了解决这些问题,我们提出了一个纹理保留扩散(TPD)模型进行虚拟试穿,它增强了结果的准确性,同时没有增加额外的图像编码器。从两个方面做出贡献。首先,我们提出将遮罩人员和参考服装图像沿着空间维度连接并利用结果图像作为扩散模型的去噪UNet输入,这使得扩散模型中的原始自注意力层能够实现高效且准确的纹理转移。其次,我们提出了一种新的扩散基方法,根据人员和参考服装图像预测精确的修复掩码,进一步提高了试穿结果的可靠性。此外,我们将掩码预测和图像合成集成到一个紧凑的模型中。实验结果表明,我们的方法可以应用于各种试穿任务,例如服装到人员和人员到服装的试穿,而且在流行的大型VITON和VITON-HD数据库上显著超过了最先进的方法。
https://arxiv.org/abs/2404.01089
Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views, removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video, and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions, we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain, our method uses multi-resolution neural feature planes for precise segmentation, which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics, especially in scenarios with complex real-world scenes. In particular, our approach eliminates the need for manual interaction, such as drawing motion masks by hand and additional pose estimation, making it a highly effective and efficient solution.
全景相机在各种应用中广泛使用,以提供广阔的视野。然而,由于不可避免地存在于其广角视野中的动态物体(包括摄影师)的存在,它们在合成新颖视角时面临挑战。在本文中,我们介绍了一种名为OmniLocalRF的新方法,可以将静态仅的场景视图同时消除和修复动态物体。我们的方法将局部辐射场原理与指向性光线双向优化相结合。我们的输入是一个全景视频,我们评估前后帧之间整个角度的相互观察。为了减少动态物体的幽灵像和修复修复遮挡,我们设计了一个多分辨率运动掩码预测模块。与现有的方法不同,我们使用多分辨率神经特征平面进行精确分割,这更适合于长360度视频。我们的实验证实,OmniLocalRF在质量和数量上优于现有方法,特别是在复杂现实场景中。特别是,我们的方法无需手动交互,例如通过手绘运动掩码和额外的姿态估计,这使得它成为一种高效且有效的解决方案。
https://arxiv.org/abs/2404.00676
Transformer based methods have achieved great success in image inpainting recently. However, we find that these solutions regard each pixel as a token, thus suffering from an information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration. 2) They quantize $256^3$ RGB values to a small number (such as 512) of quantized color values. The indices of quantized pixels are used as tokens for the inputs and prediction targets of the transformer. To mitigate these issues, we propose a new transformer based framework called "PUT". Specifically, to avoid input downsampling while maintaining computation efficiency, we design a patch-based auto-encoder P-VQVAE. The encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by input quantization, an Un-quantized Transformer is applied. It directly takes features from the P-VQVAE encoder as input without any quantization and only regards the quantized tokens as prediction targets. Furthermore, to make the inpainting process more controllable, we introduce semantic and structural conditions as extra guidance. Extensive experiments show that our method greatly outperforms existing transformer based methods on image fidelity and achieves much higher diversity and better fidelity than state-of-the-art pluralistic inpainting methods on complex large-scale datasets (e.g., ImageNet). Codes are available at this https URL.
基于Transformer的方法在最近在图像修复方面取得了巨大的成功。然而,我们发现这些解决方案将每个像素视为一个标记,从而从两个方面造成了信息损失:1)为了提高效率,它们将输入图像 downscale 到较低的分辨率。2)它们将256个三通道的RGB值量化为数量较小的(如512)个量化颜色值。量化像素的索引被用作Transformer输入和预测目标的数据。为了减轻这些问题,我们提出了一个名为“PUT”的新Transformer框架。具体来说,为了在保持计算效率的同时避免输入下采样,我们设计了一个基于补丁的自动编码器P-VQVAE。编码器将遮罩图像转换为非重叠的补丁标记,解码器从补丁标记中恢复被修复的区域,同时保持未修复的区域不变。为了消除输入量化引起的信息损失,应用了无量化Transformer。它直接将P-VQVAE编码器的特征作为输入,没有任何量化,并且只将量化标记视为预测目标。此外,为了使修复过程更加可控,我们引入了语义和结构条件作为额外的指导。大量实验证明,我们的方法在图像质量方面大大优于现有的Transformer based方法,在复杂的大型数据集(如ImageNet)上实现了更高的多样性和更好的可靠性。代码可在此处下载:https://url.cn/PUT
https://arxiv.org/abs/2404.00513
Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process.Despite the meaningful semantics generation,the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process,leading to the large discrepancy between this http URL this paper,we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy,to facilitate the consistent and meaningful semantics this http URL this end,we propose a novel structure-guided diffusion model named StrDiffusion,to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting,while revealing:1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage;2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process,benefiting from the time-dependent sparsity of the structure semantics.For the denoising process,a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions.Besides,we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process,while regulate their semantic correlations.Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts.Our code is available at this https URL.
去噪图像修复的目标是在正向过程中向图像添加噪声,并通过反向去噪过程恢复被屏蔽的区域与未屏蔽区域的纹理。尽管在语义生成方面有很多有意义的创新,但现有的艺术作品在屏蔽区和未屏蔽区之间的语义差异方面存在问题,因为语义密集的未屏蔽纹理在扩散过程中未能完全退化,而屏蔽区变成了完全的噪声,导致这种本文中的URL之间的巨大差异。为了回答如何利用未屏蔽语义指导纹理去噪过程以及如何解决语义差异的问题,本文提出了一种名为StrDiffusion的新型结构指导扩散模型,将传统的纹理去噪过程在结构指导下重新建模,以生成简单的去噪目标,同时揭示:1)在早期阶段,语义稀疏的结构对解决语义差异是有益的,而密集的纹理在晚期阶段产生合理的语义;2)未屏蔽区域的语义提供了纹理去噪过程的时间依赖结构指导,利用了结构语义的时间依赖性。对于去噪过程,我们训练一个结构指导的神经网络来估计简化去噪目标,同时通过调整它们的语义关联来调节它们的去噪效果。 extensive实验证实了StrDiffusion在现有技术水平之上的优越性。我们的代码可在此处访问:https://www.xxx
https://arxiv.org/abs/2403.19898
Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of masked scene images, incorporating both the textual prompts and specified subjects. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. The process involves (i) Locate: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data construction pipeline. This pipeline extracts substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image dataset, leveraging publicly available large models. Extensive experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency. Project page can be found at \url{this https URL}.
之前的研究在基于文本或主题图像的图像修复引导方面取得了显著进展。然而,使用其联合指导进行编辑的研究仍处于初步阶段。为应对这一挑战,我们提出了LAR-Gen,一种新型的图像修复方法,可以实现对掩膜场景图像的无缝修复,同时包含文本提示和指定的主题。我们的方法采用粗到细的方式,以确保主题身份保留和局部语义连贯。该过程包括:(i)定位:将噪声与掩膜场景图像拼接起来以实现精确的局部编辑,(ii)分配:采用解耦的跨注意机制来适应多模态指导,和(iii)优化:使用新颖的RefineNet来补充主题细节。此外,为了解决数据稀疏性问题,我们引入了一种新的数据构建管道。这个管道从庞大的图像数据集中提取大量对本地文本提示和相应视觉实例的对称数据,并利用公开可用的较大模型。大量的实验和应用场景证明了LAR-Gen在身份保留和文本语义一致性方面的卓越性。项目页面可以通过这个链接找到:https://this URL。
https://arxiv.org/abs/2403.19534
Image stitching from different captures often results in non-rectangular boundaries, which is often considered unappealing. To solve non-rectangular boundaries, current solutions involve cropping, which discards image content, inpainting, which can introduce unrelated content, or warping, which can distort non-linear features and introduce artifacts. To overcome these issues, we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion}, for image stitching rectangling. This framework combines Motion Diffusion Models (MDM) to generate motion fields, effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary. Followed by Content Diffusion Models (CDM) for image detail refinement. Notably, our sampling process utilizes a weighted map to identify regions needing correction during each iteration of CDM. Our RecDiffusion ensures geometric accuracy and overall visual appeal, surpassing all previous methods in both quantitative and qualitative measures when evaluated on public benchmarks. Code is released at this https URL.
图像拼接从不同捕获通常会导致非矩形边界,这通常被认为是不吸引人的。为解决非矩形边界,目前的解决方案包括裁剪、修复和扭曲,这些方法都会放弃图像内容或引入无关内容,或扭曲,从而扭曲非线性特征并引入伪影。为了克服这些问题,我们引入了一个新的扩散为基础的学习框架, RecDiffusion,用于图像拼接和矩形化。该框架结合了运动扩散模型 (MDM) 来生成运动场,有效地将拼接图像的不规则边界转换为几何校正的中间结果。然后是内容扩散模型 (CDM) 来修复图像细节。值得注意的是,我们的采样过程利用加权图在每次迭代 CDM 时确定需要修复的区域。我们的 RecDiffusion 确保几何准确性和整体视觉吸引力,在评估公共基准时超越了所有以前方法。代码发布在 这个链接上:
https://arxiv.org/abs/2403.19164
Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.
近年来,在大型文本到图像(T2I)模型中,取得了令人兴奋的进展,包括图像生成、3D 和视频合成。此外,个性化的技术使得根据仅几张图像作为参考,定制生成新颖概念。然而,一个有趣的问题仍然存在:是否可以从单个参考图像中捕捉多个新颖概念?在本文中,我们发现现有方法未能保留参考图像的视觉一致性,并消除概念之间的交叉影响。为了缓解这个问题,我们提出了一个注意力校准机制,以提高T2I模型的概念级理解。具体来说,我们首先引入了新的可学习模块,用类来捕获多个概念的属性。然后,在跨注意操作激活后,对类进行分离和加强,确保全面和自包含的概念。此外,我们抑制不同类别的注意激活,以减轻概念之间的相互影响。结合我们的方法,我们称为DisenDiff,可以从单个图像中学习解耦的多個概念,并生成具有学习到的概念的新定制图像。我们证明了我们的方法在质量和数量上优于现有技术水平。更重要的是,我们的技术与LoRA和修复管道兼容,可以提供更多交互式的体验。
https://arxiv.org/abs/2403.18551
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods.
户外捕获图像的质量通常受到天气的影响。一个影响观察者视线的因素是雨,它可能会遮挡依靠这些图像进行视觉检测的应用程序。本研究旨在通过自监督强化学习(RL)去除雨条纹来恢复雨图像。我们通过字典学习从输入雨图像中定位雨条纹像素,并使用像素级的RL代理进行多次修复操作,以逐渐去除雨。据我们所知,这是第一个将自监督强化学习应用于图像去雨的尝试。在多个基准图像去雨数据集上进行的实验结果表明,与最先进的少样本和自监督去雨方法相比,所提出的SRL-Derain具有优势。
https://arxiv.org/abs/2403.18270
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block, a bidirectional transformer that infers the missing labels by only looking at these tokens, and a dedicated synthesis network that couples the tokens with the partial image priors to generate coherent and pluralistic complete image even under extreme mask settings. Experiments on public benchmarks validate our design choices as the proposed method outperforms strong baselines in both visual quality and diversity metrics.
我们提出了一种基于离散潜在码生成框架的大mask多模态图像修复方法。我们的方法通过仅在图像可见位置进行计算来学习潜在先验,这些先验用标记表示。这通过一个约束性的部分编码器来实现,该编码器预测每个可见块的标记标签,一个双向Transformer,通过仅观察这些标记来推断缺失的标签,和一个专用的合成网络来实现,该网络将标记与部分图像先验耦合,以便在极端的mask设置下生成连贯和多模态完整的图像。在公开基准测试上进行的实验证实了我们设计选择的有效性,因为与强大的基线相比,所提出的方法在视觉质量和多样性度量方面都表现出色。
https://arxiv.org/abs/2403.18186
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, thereby bypassing the need to iterate. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce the Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. Notably, our proposed method enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. Furthermore, by leveraging our model's bidirectional consistency, we introduce a sampling strategy that can enhance FID while preserving the generated image content. We further showcase our model's capabilities in several downstream tasks, such as interpolation and inpainting, and present demonstrations of potential applications, including blind restoration of compressed images and defending black-box adversarial attacks.
扩散模型(DMs)通过迭代地消噪随机向量来生成高质量的样本,这个过程相当于沿着概率流普通微分方程(PF ODE)移动。有趣的是,DMs还可以通过沿着PF ODE向前移动来反转输入图像,这是下游任务(如插值和图像编辑)的关键操作。然而,这个过程的迭代性质限制了其速度,阻碍了更广泛的应用。最近,一致性模型(CMs)应运而生,通过近似PF ODE的积分来解决这一挑战,从而绕过了迭代需求。然而,缺乏显式的ODE求解器使反向过程变得复杂。为了解决这个问题,我们引入了双向一致性模型(BCM),该模型学习了一个单个神经网络,可以在PF ODE上进行前向和反向遍历,将生成和反向遍历任务在同一个框架内高效地统一起来。值得注意的是,我们所提出的方法可以在一步生成和反向遍历的同时,允许使用额外的步骤来提高生成质量或减少重构误差。此外,通过利用我们模型的双向一致性,我们引入了一种采样策略,可以在保留生成图像内容的同时增强FID。我们还展示了我们模型的能力在多个下游任务中,如插值和修复,并展示了潜在应用的演示,包括恢复压缩图像的盲修复和防御黑盒攻击。
https://arxiv.org/abs/2403.18035
We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.
我们提出了GenesisTex,一种从文本描述中合成3D几何纹理的新方法。GenesisTex通过纹理空间采样来适应预训练的图像扩散模型。具体来说,我们为每个视点维护一个潜在纹理映射,该映射在对应视点的渲染预测噪声上进行更新。采样过程包括将采样到的潜在纹理映射解密为最终纹理映射。在采样过程中,我们关注多个视点之间的全局和局部一致性:全局一致性通过噪声预测网络内的风格一致性机制实现,而低级一致性通过动态对齐潜在纹理实现。最后,我们将基于参考的修复方法和img2img应用于密度较高的纹理精饰中。我们对各种源的mesh进行的实验表明,我们的方法在数量和质量上超过了基线方法。
https://arxiv.org/abs/2403.17782
Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.
近年来,研究表明扩散模型能够生成高质量的样本,但它们的质量很大程度上取决于采样指导技术,如分类器指导(CG)和分类器无指导(CFG)。这些技术通常不适用于条件生成或在各种下游任务(如图像修复)中使用。在本文中,我们提出了一个名为扰动注意引导(PAG)的新采样指导方法,它可以提高扩散样本在无条件和有条件设置下的质量,而无需进行额外的训练或集成外部模块。PAG旨在在去噪过程中逐步增强样本的结构。它通过用扩散U-Net中的选择自注意力图替换身份矩阵来生成具有降低结构的 intermediate 样本,考虑自注意力机制捕获结构信息的能力,并引导去噪过程远离这些降低样本。在ADM和Stable Diffusion中,PAG在有条件和无条件场景下都有惊人的样本质量提升。此外,PAG在各种有条件的下游任务中显著提高了基线性能,包括空提示的控制网和图像修复(如修复和去雾)。
https://arxiv.org/abs/2403.17377
Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.
近年来,随着深度神经网络和特别是视觉Transformer的出现,视频修复任务已经取得了显著的改善。尽管这些模型显示出良好的修复质量和时间一致性,但它们仍然不适合实时视频,实现完全令人信服和可用还需要最后的一步。主要的局限性是,这些最先进的模型在修复过程中使用整个视频(离线处理),帧率不足。在我们的方法中,我们提出了一种将现有修复转换器适应这些约束的框架,通过记忆和优化冗余计算来维持较好的修复质量。使用这个框架与一些最新的修复模型一起,我们展示了非常出色的在线结果,帧率超过20帧每秒。代码和预训练模型将在审核通过后公布。
https://arxiv.org/abs/2403.16161
Image inpainting is the process of taking an image and generating lost or intentionally occluded portions. Inpainting has countless applications including restoring previously damaged pictures, restoring the quality of images that have been degraded due to compression, and removing unwanted objects/text. Modern inpainting techniques have shown remarkable ability in generating sensible completions for images with mask occlusions. In our paper, an overview of the progress of inpainting techniques will be provided, along with identifying current leading approaches, focusing on their strengths and weaknesses. A critical gap in these existing models will be addressed, focusing on the ability to prompt and control what exactly is generated. We will additionally justify why we think this is the natural next progressive step that inpainting models must take, and provide multiple approaches to implementing this functionality. Finally, we will evaluate the results of our approaches by qualitatively checking whether they generate high-quality images that correctly inpaint regions with the objects that they are instructed to produce.
图像修复是一种将图像中的缺失或故意遮挡的部分恢复下来的过程。修复在许多应用中都具有无数的作用,包括修复因压缩而损坏的照片,恢复因压缩而降低的图像质量,以及移除不需要的物体/文本。现代修复技术在生成具有遮罩遮挡的图像的合理完整图像方面表现出色。在本文中,将提供修复技术的进展概述,确定当前领先方法,重点关注其优势和不足之处。将解决现有模型中存在的关键空白,重点关注其能力以提示和控制生成的内容。此外,我们将说明我们为什么认为这是修复模型必须采取的自然下一步,并提供实现此功能的多种方法。最后,我们将通过定性检查来评估我们的方法的结果,即检查它们是否产生高质量图像,正确地修复了它们被指示修复的物体。
https://arxiv.org/abs/2403.16016
This paper proposes a mask optimization method for improving the quality of object removal using image inpainting. While many inpainting methods are trained with a set of random masks, a target for inpainting may be an object, such as a person, in many realistic scenarios. This domain gap between masks in training and inference images increases the difficulty of the inpainting task. In our method, this domain gap is resolved by training the inpainting network with object masks extracted by segmentation, and such object masks are also used in the inference step. Furthermore, to optimize the object masks for inpainting, the segmentation network is connected to the inpainting network and end-to-end trained to improve the inpainting performance. The effect of this end-to-end training is further enhanced by our mask expansion loss for achieving the trade-off between large and small masks. Experimental results demonstrate the effectiveness of our method for better object removal using image inpainting.
本文提出了一种使用图像修复方法来提高物体移除质量的口罩优化方法。虽然许多修复方法使用一组随机掩码进行训练,但在许多现实场景中,修复的目标可能是物体,例如人。训练图和推理图之间域差的存在增加了修复任务的难度。在我们的方法中,通过通过分割提取物体掩码来训练修复网络,使得修复网络使用的物体掩码与推理步骤使用的物体掩码相同。此外,为了优化用于修复的物体掩码,分割网络与修复网络相连,端到端训练以提高修复性能。通过扩展掩码损失实现大和小掩码之间的权衡,进一步增强了端到端训练的效果。实验结果证明了我们的修复方法在图像修复中的有效性。
https://arxiv.org/abs/2403.15849
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
我们介绍了Videoshop,一种无训练的视频编辑算法,用于局部语义编辑。Videoshop允许用户使用任何编辑软件,包括Photoshop和生成式修复,修改第一帧;它自动传播这些更改,具有语义、空间和时间上的一致运动,到剩余的帧。与仅通过不精确的文本指令进行编辑的现有方法不同,Videoshop允许用户通过细粒度控制位置和外观,添加或删除对象,语义更改对象,将 stock 照片插入视频等。我们通过基于图像的视频编辑实现了这一点,通过扩展噪声生成拉普拉斯变换从编辑图像中反向计算。我们用10个评估指标在两个编辑基准上评估了Videoshop与6个基准之间的编辑质量。
https://arxiv.org/abs/2403.14617