This paper proposes a mask optimization method for improving the quality of object removal using image inpainting. While many inpainting methods are trained with a set of random masks, a target for inpainting may be an object, such as a person, in many realistic scenarios. This domain gap between masks in training and inference images increases the difficulty of the inpainting task. In our method, this domain gap is resolved by training the inpainting network with object masks extracted by segmentation, and such object masks are also used in the inference step. Furthermore, to optimize the object masks for inpainting, the segmentation network is connected to the inpainting network and end-to-end trained to improve the inpainting performance. The effect of this end-to-end training is further enhanced by our mask expansion loss for achieving the trade-off between large and small masks. Experimental results demonstrate the effectiveness of our method for better object removal using image inpainting.
本文提出了一种使用图像修复方法来提高物体移除质量的口罩优化方法。虽然许多修复方法使用一组随机掩码进行训练,但在许多现实场景中,修复的目标可能是物体,例如人。训练图和推理图之间域差的存在增加了修复任务的难度。在我们的方法中,通过通过分割提取物体掩码来训练修复网络,使得修复网络使用的物体掩码与推理步骤使用的物体掩码相同。此外,为了优化用于修复的物体掩码,分割网络与修复网络相连,端到端训练以提高修复性能。通过扩展掩码损失实现大和小掩码之间的权衡,进一步增强了端到端训练的效果。实验结果证明了我们的修复方法在图像修复中的有效性。
https://arxiv.org/abs/2403.15849
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
我们介绍了Videoshop,一种无训练的视频编辑算法,用于局部语义编辑。Videoshop允许用户使用任何编辑软件,包括Photoshop和生成式修复,修改第一帧;它自动传播这些更改,具有语义、空间和时间上的一致运动,到剩余的帧。与仅通过不精确的文本指令进行编辑的现有方法不同,Videoshop允许用户通过细粒度控制位置和外观,添加或删除对象,语义更改对象,将 stock 照片插入视频等。我们通过基于图像的视频编辑实现了这一点,通过扩展噪声生成拉普拉斯变换从编辑图像中反向计算。我们用10个评估指标在两个编辑基准上评估了Videoshop与6个基准之间的编辑质量。
https://arxiv.org/abs/2403.14617
Monitoring diseases that affect the brain's structural integrity requires automated analysis of magnetic resonance (MR) images, e.g., for the evaluation of volumetric changes. However, many of the evaluation tools are optimized for analyzing healthy tissue. To enable the evaluation of scans containing pathological tissue, it is therefore required to restore healthy tissue in the pathological areas. In this work, we explore and extend denoising diffusion models for consistent inpainting of healthy 3D brain tissue. We modify state-of-the-art 2D, pseudo-3D, and 3D methods working in the image space, as well as 3D latent and 3D wavelet diffusion models, and train them to synthesize healthy brain tissue. Our evaluation shows that the pseudo-3D model performs best regarding the structural-similarity index, peak signal-to-noise ratio, and mean squared error. To emphasize the clinical relevance, we fine-tune this model on data containing synthetic MS lesions and evaluate it on a downstream brain tissue segmentation task, whereby it outperforms the established FMRIB Software Library (FSL) lesion-filling method.
监测影响大脑结构完整性的疾病需要对磁共振(MR)图像进行自动分析,例如,用于评估体积变化。然而,许多评估工具是针对分析健康组织优化的。因此,为了评估包含病理性组织的扫描,需要恢复病理性区域的健康组织。在这项工作中,我们探讨并扩展了用于一致去噪的扩散模型来修复健康3D脑组织。我们修改了在图像空间中工作的最先进的2D、伪3D和3D方法,以及3D潜在和3D波浪扩散模型,并将它们训练为合成健康脑组织。我们的评估显示,伪3D模型在结构相似性指数、峰值信号噪声比和均方误差方面表现最佳。为了强调临床相关性,我们在包含合成MS病变的数据上对 this模型进行微调,并将其在下游脑组织分割任务上进行评估。结果表明,该伪3D模型在病理性区域填充的现有FSL方法之上表现出色。
https://arxiv.org/abs/2403.14499
Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
近年来,如何实现精确的图像编辑引起了越来越多的关注,尤其是在文本到图像生成模型的显著成功的情况下。为了将各种空间感知图像编辑能力统一到一个框架中,我们采用设计领域中的层的概念,通过各种操作灵活地操纵物体。关键见解是将空间感知图像编辑任务转化为两个子任务:多层潜在分解和多层潜在融合。首先,我们将原始图像的潜在表示分割成多层,包括几个物体层和一个需要修复的残缺背景层。为了避免额外调整,我们进一步研究了自注意力机制内的自修复能力。我们引入了一种关键掩码自注意力方案,可以在掩码区域传播周围上下文信息,同时减轻其对 mask 之外区域的影響。其次,我们提出了一个指令引导的潜在融合,将多层潜在表示剪辑到画布潜在。我们还引入了在潜在空间中的异常抑制方案,以提高修复质量。由于这种多层表示的固有模块化优势,我们可以实现精确的图像编辑,并且我们证明了我们的方法 consistently超越了包括自指导化和差异编辑的最新空间编辑方法。最后,我们证明了我们的方法是一个统一框架,支持各种不同的图像编辑任务,包括六种不同的编辑任务。
https://arxiv.org/abs/2403.14487
Inpainting, for filling missing image regions, is a crucial task in various applications, such as medical imaging and remote sensing. Trending data-driven approaches efficiency, for image inpainting, often requires extensive data preprocessing. In this sense, there is still a need for model-driven approaches in case of application constrained with data availability and quality, especially for those related for time series forecasting using image inpainting techniques. This paper proposes an improved modeldriven approach relying on patch-based techniques. Our approach deviates from the standard Sum of Squared Differences (SSD) similarity measure by introducing a Hybrid Similarity (HySim), which combines both strengths of Chebychev and Minkowski distances. This hybridization enhances patch selection, leading to high-quality inpainting results with reduced mismatch errors. Experimental results proved the effectiveness of our approach against other model-driven techniques, such as diffusion or patch-based approaches, showcasing its effectiveness in achieving visually pleasing restorations.
修复缺失图像区域是各种应用中的一项关键任务,如医学成像和遥感。趋势数据驱动的方法在图像修复方面效率很高,但通常需要进行大量的数据预处理。在应用受限数据可用性和质量的情况下,尤其是在与时间序列预测相关的应用中,模型驱动方法仍然有必要。本文提出了一种基于补丁技术的改进模型驱动方法。我们的方法与标准的平方差相似度(SSD)相似度度量方法有所区别,通过引入混合相似性(HySim),结合了切比雪夫距离和Minkowski距离的优势。这种杂糅增强了补丁选择,导致修复结果质量高,匹配误差降低。实验结果证明,我们的方法对其他模型驱动方法(如扩散或基于补丁的方法)的有效性进行了展示,突出了在实现观感良好的修复效果方面的有效性。
https://arxiv.org/abs/2403.14292
Data generated in clinical practice often exhibits biases, such as long-tail imbalance and algorithmic unfairness. This study aims to mitigate these challenges through data synthesis. Previous efforts in medical imaging synthesis have struggled with separating lesion information from background context, leading to difficulties in generating high-quality backgrounds and limited control over the synthetic output. Inspired by diffusion-based image inpainting, we propose LeFusion, lesion-focused diffusion models. By redesigning the diffusion learning objectives to concentrate on lesion areas, it simplifies the model learning process and enhance the controllability of the synthetic output, while preserving background by integrating forward-diffused background contexts into the reverse diffusion process. Furthermore, we generalize it to jointly handle multi-class lesions, and further introduce a generative model for lesion masks to increase synthesis diversity. Validated on the DE-MRI cardiac lesion segmentation dataset (Emidec), our methodology employs the popular nnUNet to demonstrate that the synthetic data make it possible to effectively enhance a state-of-the-art model. Code and model are available at this https URL.
临床实践中产生的数据通常存在偏差,例如长尾不平衡和算法不公。本研究旨在通过数据合成来缓解这些挑战。先前在医学图像合成方面的努力遇到了将病灶信息与背景上下文分离的困难,导致高质量背景的生成以及合成输出控制的有限性。受到扩散为基础的图像修复启发,我们提出了LeFusion,病灶聚类的扩散模型。通过将扩散学习目标重新设计为集中于病灶区域,它简化了模型学习过程,提高了合成输出的可控性,同时通过将前向扩散背景上下文融入反向扩散过程来保留背景。此外,我们还将其扩展到处理多类病灶,并进一步引入病灶掩膜的生成模型以增加合成多样性。通过在DE-MRI心脏病灶分割数据集(Emidec)上的验证,我们的方法采用流行的nnUNet证明了合成数据使得最先进的模型能够得到有效的增强。代码和模型可在此处访问的链接中获取。
https://arxiv.org/abs/2403.14066
Implicit neural representations (INR) excel in encoding videos within neural networks, showcasing promise in computer vision tasks like video compression and denoising. INR-based approaches reconstruct video frames from content-agnostic embeddings, which hampers their efficacy in video frame regression and restricts their generalization ability for video interpolation. To address these deficiencies, Hybrid Neural Representation for Videos (HNeRV) was introduced with content-adaptive embeddings. Nevertheless, HNeRV's compression ratios remain relatively low, attributable to an oversight in leveraging the network's shallow features and inter-frame residual information. In this work, we introduce an advanced U-shaped architecture, Vector Quantized-NeRV (VQ-NeRV), which integrates a novel component--the VQ-NeRV Block. This block incorporates a codebook mechanism to discretize the network's shallow residual features and inter-frame residual information effectively. This approach proves particularly advantageous in video compression, as it results in smaller size compared to quantized features. Furthermore, we introduce an original codebook optimization technique, termed shallow codebook optimization, designed to refine the utility and efficiency of the codebook. The experimental evaluations indicate that VQ-NeRV outperforms HNeRV on video regression tasks, delivering superior reconstruction quality (with an increase of 1-2 dB in Peak Signal-to-Noise Ratio (PSNR)), better bit per pixel (bpp) efficiency, and improved video inpainting outcomes.
隐式神经表示(INR)在编码视频方面表现出色,展示了在视频压缩和去噪等计算机视觉任务中的潜力。基于INR的方法从内容无关的嵌入中重构视频帧,这会削弱他们在视频帧回归和视频插值方面的效果,并限制其通用能力。为解决这些不足,我们引入了Hybrid Neural Representation for Videos(HNeRV),它使用内容自适应嵌入。然而,HNeRV的压缩比仍然相对较低,这是由于在利用网络的浅层特征和跨帧残差信息方面存在疏漏。在这项工作中,我们引入了一种先进的U型架构,称为Vector Quantized-NeRV(VQ-NeRV),它包含一个新颖的组件——VQ-NeRV块。这个块采用了一种有效的编码方案来离散化网络的浅层残差特征和跨帧残差信息。这种方法在视频压缩方面尤其优越,因为结果是相比量化特征更小的尺寸。此外,我们还引入了一种原始代码本优化技术,称为浅层代码本优化,旨在优化代码本的效用和效率。实验评估结果表明,VQ-NeRV在视频回归任务中优于HNeRV,实现了卓越的重建质量(在峰值信号-噪声比(PSNR)上增加1-2 dB),更好的每像素(bpp)效率和改善的视频修复效果。
https://arxiv.org/abs/2403.12401
Face inpainting, the technique of restoring missing or damaged regions in facial images, is pivotal for applications like face recognition in occluded scenarios and image analysis with poor-quality captures. This process not only needs to produce realistic visuals but also preserve individual identity characteristics. The aim of this paper is to inpaint a face given periocular region (eyes-to-face) through a proposed new Generative Adversarial Network (GAN)-based model called Eyes-to-Face Network (E2F-Net). The proposed approach extracts identity and non-identity features from the periocular region using two dedicated encoders have been used. The extracted features are then mapped to the latent space of a pre-trained StyleGAN generator to benefit from its state-of-the-art performance and its rich, diverse and expressive latent space without any additional training. We further improve the StyleGAN output to find the optimal code in the latent space using a new optimization for GAN inversion technique. Our E2F-Net requires a minimum training process reducing the computational complexity as a secondary benefit. Through extensive experiments, we show that our method successfully reconstructs the whole face with high quality, surpassing current techniques, despite significantly less training and supervision efforts. We have generated seven eyes-to-face datasets based on well-known public face datasets for training and verifying our proposed methods. The code and datasets are publicly available.
面部修复技术,即在面部图像中恢复缺失或受损区域的算法,对于应用如在遮挡场景下进行面部识别和低质量图像分析来说至关重要。这一过程不仅要产生逼真的视觉效果,还应保留个体的身份特征。本文旨在通过一种基于提出的新生成对抗网络(GAN)模型,即Eyes-to-Face Network(E2F-Net),对给定的外侧眼部区域(从眼睛到脸)进行修复。该方法通过使用两个专门编码器从外侧眼部区域提取身份和无关特征。提取的特征随后映射到预训练的StyleGAN生成器的潜在空间,以利用其最先进的性能和丰富的、多样化和表现力的潜在空间而无需额外训练。我们进一步通过新GAN反向优化技术对StyleGAN输出进行优化,以找到在潜在空间中最佳的代码。我们的E2F-Net需要最小训练过程,作为其次要好处,从而降低计算复杂度。通过广泛的实验,我们发现,我们的方法在高质量地重构整个面部,超越现有技术,尽管训练和监督 efforts大大减少。我们已经基于知名公共面部数据集生成七个眼睛-to-face数据集,用于训练和验证我们所提出的方法。代码和数据集都是公开可用的。
https://arxiv.org/abs/2403.12197
Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [this http URL](this http URL).
近年来在视频生成方面的进展令人印象深刻,然而,许多现有方法在一致性和文本-视频对齐方面存在问题。此外,该领域缺乏有效的文本指导视频修复技术,与文本指导图像修复领域已被充分探索的领域形成鲜明对比。为此,本文提出了一种新颖的文本指导视频修复模型,实现了更好的一致性、可控制性和兼容性。具体来说,我们引入了一个简单但高效的动作捕捉模块来保留运动一致性,并设计了一个实例感知区域选择,而不是随机区域选择,以获得更好的文本控制性,并利用一种新颖的方法将一些个性化的模型注入到我们的CoCoCo模型中,从而实现更好的模型兼容性。大量实验结果表明,我们的模型可以生成高质量的视频剪辑。同时,我们的模型在运动一致性、文本控制性和模型兼容性方面表现更好。更多细节可见于[http://www.thisurl.com](http://www.thisurl.com)。
https://arxiv.org/abs/2403.12035
Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.
扩散模型是图像和视频合成进步的主要驱动力,但它们在推理速度方面较慢。像最近引入的对抗扩散蒸馏(ADD)方法旨在将模型从多视角到单步推理的转变,尽管这需要由于其依赖固定预训练的DINOv2判别器而产生昂贵且难以优化。我们引入了潜在对抗扩散蒸馏(LADD),一种克服ADD局限性的新蒸馏方法。与基于像素的ADD不同,LADD利用预训练的潜在扩散模型的生成特征。这种方法简化了训练,提高了性能,使得能够实现高分辨率的多角度图像合成。我们将LADD应用于Stable Diffusion 3(8B),以获得SD3-Turbo,一种仅使用四个无指导采样步骤就能与最先进的文本到图像生成器的性能相匹敌的快速模型。此外,我们系统地研究了其扩展行为,并在图像编辑和修复等各种应用中证明了LADD的有效性。
https://arxiv.org/abs/2403.12015
Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.
文本到纹理合成已成为3D内容创作的新前沿,得益于最近在文本到图像模型方面的进步。现有的方法主要采用预训练的深度感知扩散和修复模型,但它们存在一些缺陷,如3D不一致性和可控制性有限。为了应对这些挑战,我们引入了InteX,一种交互式的文本到纹理合成框架。1) InteX包括一个用户友好的界面,在整个合成过程中促进交互和控制,实现区域特定的重新绘制和精确纹理编辑。2) 此外,我们开发了一个统一的深度感知修复模型,将深度信息与修复提示相结合,有效缓解3D不一致性并提高生成速度。通过大量的实验,我们的框架在文本到纹理合成上被证明既实用又有效,为高质量3D内容创作铺平道路。
https://arxiv.org/abs/2403.11878
Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.
解决图像反问题(例如超分辨率 和修复)需要生成具有给定输入的高保真度的图像。通过使用输入图像作为指导,我们可以利用预训练的扩散生成模型来解决广泛的图像反任务,而无需对任务特定的模型进行微调。为了精确估计输入图像的指导得分函数,我们提出了扩散策略梯度(DPG),这是一种通过将中间嘈杂图像视为策略,将目标图像视为策略选择的状态的可行计算方法。实验表明,我们的方法对多线性和非线性反任务具有鲁棒性,在FFHQ、ImageNet和LSUN数据集上,图像修复质量更高。
https://arxiv.org/abs/2403.10585
Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
近年来,基于文本的3D场景生成技术取得了快速进展。其成功主要归功于使用现有的生成模型迭代进行图像扭曲和修复以生成3D场景。然而,这些方法过于依赖现有模型的输出,导致在几何和外观上产生误差,从而使模型无法应用于各种场景(例如户外和虚幻场景)。为了应对这一局限,我们通过查询和聚合全局3D信息来生成局部视图,然后逐步生成3D场景。具体来说,我们采用基于三平面特征的NeRF作为统一的三维场景表示,约束全局3D一致性,并提出了一个生成修复网络,通过利用扩散模型的自然图像先验以及当前场景的全局3D信息来合成高质量的新内容。我们广泛的实验证明,与以前的方法相比,我们的方法在支持各种场景生成和任意相机轨迹的同时,提高了视觉质量和3D一致性。
https://arxiv.org/abs/2403.09439
Referring object removal refers to removing the specific object in an image referred by natural language expressions and filling the missing region with reasonable semantics. To address this task, we construct the ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs. Each pair contains an image with referring expressions and the ground truth after elimination. We further propose an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure. Linguistic features are hierarchically extracted at the syntactic level and fused in the downsampling process of visual features with multi-head attention. The feature-aligned pyramid network is leveraged to generate segmentation masks and replace internal pixels with region affinity learned from external semantics in high-level feature maps. Extensive experiments demonstrate that our model outperforms diffusion models and two-stage methods which process the segmentation and inpainting task separately by a significant margin.
指称对象移除是指通过自然语言表达来指定图像中特定对象的移除,并在其缺失区域中填充合理的语义信息。为解决此任务,我们构建了ComCOCO,一个由136,495个指称表达和34,615个对象在23,951个图像对构成的合成数据集。每对包含一个带有指称表达的图像和一个经过消除的地面真值。我们进一步提出了一个端到端的语法感知混合映射网络,具有编码-解码结构。在语义级别上提取语言特征,并在视觉特征的降采样过程中使用多头注意。利用特征对齐的金字塔网络来生成分割掩码,并在高级特征图上用从外部语义学获得的区域亲和力替换内部像素。大量实验证明,我们的模型在扩散模型和两个阶段的处理分离的分割和修复任务方面显著优于该领域。
https://arxiv.org/abs/2403.09128
We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Our method, Ambient Diffusion Posterior Sampling (A-DPS), leverages a generative model pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling conditioned on measurements from a potentially different forward process (e.g. image blurring). We test the efficacy of our approach on standard natural image datasets (CelebA, FFHQ, and AFHQ) and we show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance. We further extend the Ambient Diffusion framework to train MRI models with access only to Fourier subsampled multi-coil MRI measurements at various acceleration factors (R=2, 4, 6, 8). We again observe that models trained on highly subsampled data are better priors for solving inverse problems in the high acceleration regime than models trained on fully sampled data. We open-source our code and the trained Ambient Diffusion MRI models: this https URL .
我们为解决由线性污染数据学习到的扩散模型中的反问题提供了一个框架。我们的方法A-DPS利用预先训练在一种类型污染(例如图像修复)上的生成模型,对来自可能与前向过程(例如图像模糊)不同的测量进行后验采样。我们在标准自然图像数据集(CelebA,FFHQ和AFHQ)上测试了我们的方法的效力,结果表明,A-DPS有时可以比在干净数据上训练的模型在速度和性能上优秀。我们进一步扩展了Ambient Diffusion框架,用于在各种加速因子(R=2,4,6,8)下训练仅访问傅里叶子集多圈MRI测量值的MRI模型。我们再次观察到,在高速度和高采样率领域,训练在高子采样数据上的模型比训练在完全采样数据上的模型更适合解决反问题。我们将我们的代码和训练好的Ambient Diffusion MRI模型开源到这个链接:https://github.com/your-username/Ambient-Diffusion-MRI 。
https://arxiv.org/abs/2403.08728
Image-based virtual try-on aims to transfer target in-shop clothing to a dressed model image, the objectives of which are totally taking off original clothing while preserving the contents outside of the try-on area, naturally wearing target clothing and correctly inpainting the gap between target clothing and original clothing. Tremendous efforts have been made to facilitate this popular research area, but cannot keep the type of target clothing with the try-on area affected by original clothing. In this paper, we focus on the unpaired virtual try-on situation where target clothing and original clothing on the model are different, i.e., the practical scenario. To break the correlation between the try-on area and the original clothing and make the model learn the correct information to inpaint, we propose an adaptive mask training paradigm that dynamically adjusts training masks. It not only improves the alignment and fit of clothing but also significantly enhances the fidelity of virtual try-on experience. Furthermore, we for the first time propose two metrics for unpaired try-on evaluation, the Semantic-Densepose-Ratio (SDR) and Skeleton-LPIPS (S-LPIPS), to evaluate the correctness of clothing type and the accuracy of clothing texture. For unpaired try-on validation, we construct a comprehensive cross-try-on benchmark (Cross-27) with distinctive clothing items and model physiques, covering a broad try-on scenarios. Experiments demonstrate the effectiveness of the proposed methods, contributing to the advancement of virtual try-on technology and offering new insights and tools for future research in the field. The code, model and benchmark will be publicly released.
图像为基础的虚拟试穿旨在将目标商店中的服装转移到 dressed模型图像中,其目标是完全脱离原始服装,同时保留试穿区域以外的内容,自然穿着目标服装并正确修复目标服装和原始服装之间的缺口。为了促进这个热门的研究领域,已经做出了巨大的努力,但是无法保持受到原始服装影响的有试穿区域的类型。在本文中,我们关注的是目标服装和原始服装不同的未配对虚拟试穿场景,即实际场景。为了打破试穿区域和原始服装之间的相关性,并使模型学会正确修复信息,我们提出了一个自适应掩码训练范式,动态调整训练掩码。它不仅提高了服装的对齐和贴合度,而且显著增强了虚拟试穿体验的准确性。此外,我们还首次提出了两个未配对试穿评估指标,即语义密度比(SDR)和骨架LPIPS(S-LPIPS),以评估服装类型和服装纹理的准确性。对于未配对试穿验证,我们构建了一个具有独特服装项目和模型体型的全面试穿基准(Cross-27),涵盖广泛的试穿场景。实验证明所提出的方法的有效性,为虚拟试穿技术的发展做出了贡献,并为该领域的未来研究提供了新的见解和工具。代码、模型和基准将公开发布。
https://arxiv.org/abs/2403.08453
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
现有的基于条件图像生成模型的生成对抗网络(GAN)通常会对相同的条件输入产生固定的输出,这对于高度主观的任务(如大规模掩码图像修复或风格迁移)来说是不合理的。另一方面,基于GAN的多样图像生成方法需要重新训练或微调网络或设计复杂的噪声注入函数,这会导致计算开销、任务特定或很难生成高质量结果。鉴于许多确定性条件图像生成模型已经能够产生高质量但固定的结果,我们提出了一个有趣的问题:是否可以在不改变网络结构或参数的情况下,使预训练的确定性条件图像生成模型产生多样化的结果?为了回答这个问题,我们重新审视了条件图像生成任务,从攻击者的角度出发,提出了一种简单而有效的插值平滑梯度下降(PGD)类似方法,用于多样且可控制图像生成。关键思想是对输入条件添加一个微小的扰动。这样,就可以在不调整网络结构或对预训练模型进行微调的情况下生成多样化的结果。此外,我们还可以根据参考文本或图像指定攻击方向,从而控制生成的多样结果。我们的工作为将对抗攻击应用于低级视觉任务打开了大门,而各种条件图像生成任务的实验结果也证明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2403.08294
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at this https URL.
在这项工作中,我们研究了大型语言模型(LLM)直接理解视觉信号的潜力,而无需在多模态数据集上进行微调。我们方法的基本概念是将图像看作一个语言实体,并将其转化为LLM词汇表中的一组离散单词。为了实现这一目标,我们提出了 Vision-to-Language Tokenizer,简称V2T Tokenizer,它通过联合编码器-解码器、LLM词汇表和CLIP模型将图像转换为“外语”。有了这种创新性的图像编码,LLM不仅能够实现视觉理解,而且能够以自回归的方式进行图像去噪和修复。我们进行了严格的实验来验证我们的方法,包括理解任务(图像识别、图像标题和视觉问答)和图像去噪任务(修复、去模糊和位移恢复)。代码和模型可在此https URL找到。
https://arxiv.org/abs/2403.07874
We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at this https URL.
我们提出了一个名为“SemCity”的3D扩散模型,用于在现实世界户外环境中生成语义场景。大多数3D扩散模型集中于生成单个物体、合成室内场景或合成室外场景,而现实世界户外场景的生成很少被关注。在本文中,我们专注于通过在现实世界户外数据集中学习扩散模型来生成真实户外场景。与合成数据相比,现实世界户外数据集通常包含更多的空旷空间,导致学习真实户外分布具有挑战性。为了解决这个问题,我们利用三平面表示作为一种场景分布的代理形式,作为我们的扩散模型可以学习的三平面操作。此外,我们还提出了一种与三平面扩散模型无缝集成的三平面操作。操作改善了我们的扩散模型在户外场景生成任务中的适用性,例如场景修复、场景去修复和语义场景完成 refinements。在实验结果中,我们证明了我们的三平面扩散模型在真实户外数据集上的生成结果与现有工作相比具有实际意义,即使在语义KITTI数据集上也是如此。我们还证明了我们的三平面操作使场景内对象在不同场景之间的添加、删除或修改变得更加容易。此外,它还使场景可以扩展到城市级别。最后,我们在语义场景完成 refinements 上评估我们的方法,我们的扩散模型通过学习场景分布增强了语义场景完成网络的预测。我们的代码可在此处访问:https://www.xxxxxx.com/
https://arxiv.org/abs/2403.07773
Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。
https://arxiv.org/abs/2403.07518