Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.
近年来,在大型文本到图像(T2I)模型中,取得了令人兴奋的进展,包括图像生成、3D 和视频合成。此外,个性化的技术使得根据仅几张图像作为参考,定制生成新颖概念。然而,一个有趣的问题仍然存在:是否可以从单个参考图像中捕捉多个新颖概念?在本文中,我们发现现有方法未能保留参考图像的视觉一致性,并消除概念之间的交叉影响。为了缓解这个问题,我们提出了一个注意力校准机制,以提高T2I模型的概念级理解。具体来说,我们首先引入了新的可学习模块,用类来捕获多个概念的属性。然后,在跨注意操作激活后,对类进行分离和加强,确保全面和自包含的概念。此外,我们抑制不同类别的注意激活,以减轻概念之间的相互影响。结合我们的方法,我们称为DisenDiff,可以从单个图像中学习解耦的多個概念,并生成具有学习到的概念的新定制图像。我们证明了我们的方法在质量和数量上优于现有技术水平。更重要的是,我们的技术与LoRA和修复管道兼容,可以提供更多交互式的体验。
https://arxiv.org/abs/2403.18551
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370
The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods.
户外捕获图像的质量通常受到天气的影响。一个影响观察者视线的因素是雨,它可能会遮挡依靠这些图像进行视觉检测的应用程序。本研究旨在通过自监督强化学习(RL)去除雨条纹来恢复雨图像。我们通过字典学习从输入雨图像中定位雨条纹像素,并使用像素级的RL代理进行多次修复操作,以逐渐去除雨。据我们所知,这是第一个将自监督强化学习应用于图像去雨的尝试。在多个基准图像去雨数据集上进行的实验结果表明,与最先进的少样本和自监督去雨方法相比,所提出的SRL-Derain具有优势。
https://arxiv.org/abs/2403.18270
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block, a bidirectional transformer that infers the missing labels by only looking at these tokens, and a dedicated synthesis network that couples the tokens with the partial image priors to generate coherent and pluralistic complete image even under extreme mask settings. Experiments on public benchmarks validate our design choices as the proposed method outperforms strong baselines in both visual quality and diversity metrics.
我们提出了一种基于离散潜在码生成框架的大mask多模态图像修复方法。我们的方法通过仅在图像可见位置进行计算来学习潜在先验,这些先验用标记表示。这通过一个约束性的部分编码器来实现,该编码器预测每个可见块的标记标签,一个双向Transformer,通过仅观察这些标记来推断缺失的标签,和一个专用的合成网络来实现,该网络将标记与部分图像先验耦合,以便在极端的mask设置下生成连贯和多模态完整的图像。在公开基准测试上进行的实验证实了我们设计选择的有效性,因为与强大的基线相比,所提出的方法在视觉质量和多样性度量方面都表现出色。
https://arxiv.org/abs/2403.18186
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, thereby bypassing the need to iterate. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce the Bidirectional Consistency Model (BCM), which learns a single neural network that enables both forward and backward traversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. Notably, our proposed method enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. Furthermore, by leveraging our model's bidirectional consistency, we introduce a sampling strategy that can enhance FID while preserving the generated image content. We further showcase our model's capabilities in several downstream tasks, such as interpolation and inpainting, and present demonstrations of potential applications, including blind restoration of compressed images and defending black-box adversarial attacks.
扩散模型(DMs)通过迭代地消噪随机向量来生成高质量的样本,这个过程相当于沿着概率流普通微分方程(PF ODE)移动。有趣的是,DMs还可以通过沿着PF ODE向前移动来反转输入图像,这是下游任务(如插值和图像编辑)的关键操作。然而,这个过程的迭代性质限制了其速度,阻碍了更广泛的应用。最近,一致性模型(CMs)应运而生,通过近似PF ODE的积分来解决这一挑战,从而绕过了迭代需求。然而,缺乏显式的ODE求解器使反向过程变得复杂。为了解决这个问题,我们引入了双向一致性模型(BCM),该模型学习了一个单个神经网络,可以在PF ODE上进行前向和反向遍历,将生成和反向遍历任务在同一个框架内高效地统一起来。值得注意的是,我们所提出的方法可以在一步生成和反向遍历的同时,允许使用额外的步骤来提高生成质量或减少重构误差。此外,通过利用我们模型的双向一致性,我们引入了一种采样策略,可以在保留生成图像内容的同时增强FID。我们还展示了我们模型的能力在多个下游任务中,如插值和修复,并展示了潜在应用的演示,包括恢复压缩图像的盲修复和防御黑盒攻击。
https://arxiv.org/abs/2403.18035
We present GenesisTex, a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically, we maintain a latent texture map for each viewpoint, which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process, we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network, and low-level consistency is achieved by dynamically aligning latent textures. Finally, we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively.
我们提出了GenesisTex,一种从文本描述中合成3D几何纹理的新方法。GenesisTex通过纹理空间采样来适应预训练的图像扩散模型。具体来说,我们为每个视点维护一个潜在纹理映射,该映射在对应视点的渲染预测噪声上进行更新。采样过程包括将采样到的潜在纹理映射解密为最终纹理映射。在采样过程中,我们关注多个视点之间的全局和局部一致性:全局一致性通过噪声预测网络内的风格一致性机制实现,而低级一致性通过动态对齐潜在纹理实现。最后,我们将基于参考的修复方法和img2img应用于密度较高的纹理精饰中。我们对各种源的mesh进行的实验表明,我们的方法在数量和质量上超过了基线方法。
https://arxiv.org/abs/2403.17782
Recent studies have demonstrated that diffusion models are capable of generating high-quality samples, but their quality heavily depends on sampling guidance techniques, such as classifier guidance (CG) and classifier-free guidance (CFG). These techniques are often not applicable in unconditional generation or in various downstream tasks such as image restoration. In this paper, we propose a novel sampling guidance, called Perturbed-Attention Guidance (PAG), which improves diffusion sample quality across both unconditional and conditional settings, achieving this without requiring additional training or the integration of external modules. PAG is designed to progressively enhance the structure of samples throughout the denoising process. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, by considering the self-attention mechanisms' ability to capture structural information, and guiding the denoising process away from these degraded samples. In both ADM and Stable Diffusion, PAG surprisingly improves sample quality in conditional and even unconditional scenarios. Moreover, PAG significantly improves the baseline performance in various downstream tasks where existing guidances such as CG or CFG cannot be fully utilized, including ControlNet with empty prompts and image restoration such as inpainting and deblurring.
近年来,研究表明扩散模型能够生成高质量的样本,但它们的质量很大程度上取决于采样指导技术,如分类器指导(CG)和分类器无指导(CFG)。这些技术通常不适用于条件生成或在各种下游任务(如图像修复)中使用。在本文中,我们提出了一个名为扰动注意引导(PAG)的新采样指导方法,它可以提高扩散样本在无条件和有条件设置下的质量,而无需进行额外的训练或集成外部模块。PAG旨在在去噪过程中逐步增强样本的结构。它通过用扩散U-Net中的选择自注意力图替换身份矩阵来生成具有降低结构的 intermediate 样本,考虑自注意力机制捕获结构信息的能力,并引导去噪过程远离这些降低样本。在ADM和Stable Diffusion中,PAG在有条件和无条件场景下都有惊人的样本质量提升。此外,PAG在各种有条件的下游任务中显著提高了基线性能,包括空提示的控制网和图像修复(如修复和去雾)。
https://arxiv.org/abs/2403.17377
Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.
近年来,随着深度神经网络和特别是视觉Transformer的出现,视频修复任务已经取得了显著的改善。尽管这些模型显示出良好的修复质量和时间一致性,但它们仍然不适合实时视频,实现完全令人信服和可用还需要最后的一步。主要的局限性是,这些最先进的模型在修复过程中使用整个视频(离线处理),帧率不足。在我们的方法中,我们提出了一种将现有修复转换器适应这些约束的框架,通过记忆和优化冗余计算来维持较好的修复质量。使用这个框架与一些最新的修复模型一起,我们展示了非常出色的在线结果,帧率超过20帧每秒。代码和预训练模型将在审核通过后公布。
https://arxiv.org/abs/2403.16161
Image inpainting is the process of taking an image and generating lost or intentionally occluded portions. Inpainting has countless applications including restoring previously damaged pictures, restoring the quality of images that have been degraded due to compression, and removing unwanted objects/text. Modern inpainting techniques have shown remarkable ability in generating sensible completions for images with mask occlusions. In our paper, an overview of the progress of inpainting techniques will be provided, along with identifying current leading approaches, focusing on their strengths and weaknesses. A critical gap in these existing models will be addressed, focusing on the ability to prompt and control what exactly is generated. We will additionally justify why we think this is the natural next progressive step that inpainting models must take, and provide multiple approaches to implementing this functionality. Finally, we will evaluate the results of our approaches by qualitatively checking whether they generate high-quality images that correctly inpaint regions with the objects that they are instructed to produce.
图像修复是一种将图像中的缺失或故意遮挡的部分恢复下来的过程。修复在许多应用中都具有无数的作用,包括修复因压缩而损坏的照片,恢复因压缩而降低的图像质量,以及移除不需要的物体/文本。现代修复技术在生成具有遮罩遮挡的图像的合理完整图像方面表现出色。在本文中,将提供修复技术的进展概述,确定当前领先方法,重点关注其优势和不足之处。将解决现有模型中存在的关键空白,重点关注其能力以提示和控制生成的内容。此外,我们将说明我们为什么认为这是修复模型必须采取的自然下一步,并提供实现此功能的多种方法。最后,我们将通过定性检查来评估我们的方法的结果,即检查它们是否产生高质量图像,正确地修复了它们被指示修复的物体。
https://arxiv.org/abs/2403.16016
This paper proposes a mask optimization method for improving the quality of object removal using image inpainting. While many inpainting methods are trained with a set of random masks, a target for inpainting may be an object, such as a person, in many realistic scenarios. This domain gap between masks in training and inference images increases the difficulty of the inpainting task. In our method, this domain gap is resolved by training the inpainting network with object masks extracted by segmentation, and such object masks are also used in the inference step. Furthermore, to optimize the object masks for inpainting, the segmentation network is connected to the inpainting network and end-to-end trained to improve the inpainting performance. The effect of this end-to-end training is further enhanced by our mask expansion loss for achieving the trade-off between large and small masks. Experimental results demonstrate the effectiveness of our method for better object removal using image inpainting.
本文提出了一种使用图像修复方法来提高物体移除质量的口罩优化方法。虽然许多修复方法使用一组随机掩码进行训练,但在许多现实场景中,修复的目标可能是物体,例如人。训练图和推理图之间域差的存在增加了修复任务的难度。在我们的方法中,通过通过分割提取物体掩码来训练修复网络,使得修复网络使用的物体掩码与推理步骤使用的物体掩码相同。此外,为了优化用于修复的物体掩码,分割网络与修复网络相连,端到端训练以提高修复性能。通过扩展掩码损失实现大和小掩码之间的权衡,进一步增强了端到端训练的效果。实验结果证明了我们的修复方法在图像修复中的有效性。
https://arxiv.org/abs/2403.15849
We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
我们介绍了Videoshop,一种无训练的视频编辑算法,用于局部语义编辑。Videoshop允许用户使用任何编辑软件,包括Photoshop和生成式修复,修改第一帧;它自动传播这些更改,具有语义、空间和时间上的一致运动,到剩余的帧。与仅通过不精确的文本指令进行编辑的现有方法不同,Videoshop允许用户通过细粒度控制位置和外观,添加或删除对象,语义更改对象,将 stock 照片插入视频等。我们通过基于图像的视频编辑实现了这一点,通过扩展噪声生成拉普拉斯变换从编辑图像中反向计算。我们用10个评估指标在两个编辑基准上评估了Videoshop与6个基准之间的编辑质量。
https://arxiv.org/abs/2403.14617
Monitoring diseases that affect the brain's structural integrity requires automated analysis of magnetic resonance (MR) images, e.g., for the evaluation of volumetric changes. However, many of the evaluation tools are optimized for analyzing healthy tissue. To enable the evaluation of scans containing pathological tissue, it is therefore required to restore healthy tissue in the pathological areas. In this work, we explore and extend denoising diffusion models for consistent inpainting of healthy 3D brain tissue. We modify state-of-the-art 2D, pseudo-3D, and 3D methods working in the image space, as well as 3D latent and 3D wavelet diffusion models, and train them to synthesize healthy brain tissue. Our evaluation shows that the pseudo-3D model performs best regarding the structural-similarity index, peak signal-to-noise ratio, and mean squared error. To emphasize the clinical relevance, we fine-tune this model on data containing synthetic MS lesions and evaluate it on a downstream brain tissue segmentation task, whereby it outperforms the established FMRIB Software Library (FSL) lesion-filling method.
监测影响大脑结构完整性的疾病需要对磁共振(MR)图像进行自动分析,例如,用于评估体积变化。然而,许多评估工具是针对分析健康组织优化的。因此,为了评估包含病理性组织的扫描,需要恢复病理性区域的健康组织。在这项工作中,我们探讨并扩展了用于一致去噪的扩散模型来修复健康3D脑组织。我们修改了在图像空间中工作的最先进的2D、伪3D和3D方法,以及3D潜在和3D波浪扩散模型,并将它们训练为合成健康脑组织。我们的评估显示,伪3D模型在结构相似性指数、峰值信号噪声比和均方误差方面表现最佳。为了强调临床相关性,我们在包含合成MS病变的数据上对 this模型进行微调,并将其在下游脑组织分割任务上进行评估。结果表明,该伪3D模型在病理性区域填充的现有FSL方法之上表现出色。
https://arxiv.org/abs/2403.14499
Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
近年来,如何实现精确的图像编辑引起了越来越多的关注,尤其是在文本到图像生成模型的显著成功的情况下。为了将各种空间感知图像编辑能力统一到一个框架中,我们采用设计领域中的层的概念,通过各种操作灵活地操纵物体。关键见解是将空间感知图像编辑任务转化为两个子任务:多层潜在分解和多层潜在融合。首先,我们将原始图像的潜在表示分割成多层,包括几个物体层和一个需要修复的残缺背景层。为了避免额外调整,我们进一步研究了自注意力机制内的自修复能力。我们引入了一种关键掩码自注意力方案,可以在掩码区域传播周围上下文信息,同时减轻其对 mask 之外区域的影響。其次,我们提出了一个指令引导的潜在融合,将多层潜在表示剪辑到画布潜在。我们还引入了在潜在空间中的异常抑制方案,以提高修复质量。由于这种多层表示的固有模块化优势,我们可以实现精确的图像编辑,并且我们证明了我们的方法 consistently超越了包括自指导化和差异编辑的最新空间编辑方法。最后,我们证明了我们的方法是一个统一框架,支持各种不同的图像编辑任务,包括六种不同的编辑任务。
https://arxiv.org/abs/2403.14487
Inpainting, for filling missing image regions, is a crucial task in various applications, such as medical imaging and remote sensing. Trending data-driven approaches efficiency, for image inpainting, often requires extensive data preprocessing. In this sense, there is still a need for model-driven approaches in case of application constrained with data availability and quality, especially for those related for time series forecasting using image inpainting techniques. This paper proposes an improved modeldriven approach relying on patch-based techniques. Our approach deviates from the standard Sum of Squared Differences (SSD) similarity measure by introducing a Hybrid Similarity (HySim), which combines both strengths of Chebychev and Minkowski distances. This hybridization enhances patch selection, leading to high-quality inpainting results with reduced mismatch errors. Experimental results proved the effectiveness of our approach against other model-driven techniques, such as diffusion or patch-based approaches, showcasing its effectiveness in achieving visually pleasing restorations.
修复缺失图像区域是各种应用中的一项关键任务,如医学成像和遥感。趋势数据驱动的方法在图像修复方面效率很高,但通常需要进行大量的数据预处理。在应用受限数据可用性和质量的情况下,尤其是在与时间序列预测相关的应用中,模型驱动方法仍然有必要。本文提出了一种基于补丁技术的改进模型驱动方法。我们的方法与标准的平方差相似度(SSD)相似度度量方法有所区别,通过引入混合相似性(HySim),结合了切比雪夫距离和Minkowski距离的优势。这种杂糅增强了补丁选择,导致修复结果质量高,匹配误差降低。实验结果证明,我们的方法对其他模型驱动方法(如扩散或基于补丁的方法)的有效性进行了展示,突出了在实现观感良好的修复效果方面的有效性。
https://arxiv.org/abs/2403.14292
Data generated in clinical practice often exhibits biases, such as long-tail imbalance and algorithmic unfairness. This study aims to mitigate these challenges through data synthesis. Previous efforts in medical imaging synthesis have struggled with separating lesion information from background context, leading to difficulties in generating high-quality backgrounds and limited control over the synthetic output. Inspired by diffusion-based image inpainting, we propose LeFusion, lesion-focused diffusion models. By redesigning the diffusion learning objectives to concentrate on lesion areas, it simplifies the model learning process and enhance the controllability of the synthetic output, while preserving background by integrating forward-diffused background contexts into the reverse diffusion process. Furthermore, we generalize it to jointly handle multi-class lesions, and further introduce a generative model for lesion masks to increase synthesis diversity. Validated on the DE-MRI cardiac lesion segmentation dataset (Emidec), our methodology employs the popular nnUNet to demonstrate that the synthetic data make it possible to effectively enhance a state-of-the-art model. Code and model are available at this https URL.
临床实践中产生的数据通常存在偏差,例如长尾不平衡和算法不公。本研究旨在通过数据合成来缓解这些挑战。先前在医学图像合成方面的努力遇到了将病灶信息与背景上下文分离的困难,导致高质量背景的生成以及合成输出控制的有限性。受到扩散为基础的图像修复启发,我们提出了LeFusion,病灶聚类的扩散模型。通过将扩散学习目标重新设计为集中于病灶区域,它简化了模型学习过程,提高了合成输出的可控性,同时通过将前向扩散背景上下文融入反向扩散过程来保留背景。此外,我们还将其扩展到处理多类病灶,并进一步引入病灶掩膜的生成模型以增加合成多样性。通过在DE-MRI心脏病灶分割数据集(Emidec)上的验证,我们的方法采用流行的nnUNet证明了合成数据使得最先进的模型能够得到有效的增强。代码和模型可在此处访问的链接中获取。
https://arxiv.org/abs/2403.14066
Implicit neural representations (INR) excel in encoding videos within neural networks, showcasing promise in computer vision tasks like video compression and denoising. INR-based approaches reconstruct video frames from content-agnostic embeddings, which hampers their efficacy in video frame regression and restricts their generalization ability for video interpolation. To address these deficiencies, Hybrid Neural Representation for Videos (HNeRV) was introduced with content-adaptive embeddings. Nevertheless, HNeRV's compression ratios remain relatively low, attributable to an oversight in leveraging the network's shallow features and inter-frame residual information. In this work, we introduce an advanced U-shaped architecture, Vector Quantized-NeRV (VQ-NeRV), which integrates a novel component--the VQ-NeRV Block. This block incorporates a codebook mechanism to discretize the network's shallow residual features and inter-frame residual information effectively. This approach proves particularly advantageous in video compression, as it results in smaller size compared to quantized features. Furthermore, we introduce an original codebook optimization technique, termed shallow codebook optimization, designed to refine the utility and efficiency of the codebook. The experimental evaluations indicate that VQ-NeRV outperforms HNeRV on video regression tasks, delivering superior reconstruction quality (with an increase of 1-2 dB in Peak Signal-to-Noise Ratio (PSNR)), better bit per pixel (bpp) efficiency, and improved video inpainting outcomes.
隐式神经表示(INR)在编码视频方面表现出色,展示了在视频压缩和去噪等计算机视觉任务中的潜力。基于INR的方法从内容无关的嵌入中重构视频帧,这会削弱他们在视频帧回归和视频插值方面的效果,并限制其通用能力。为解决这些不足,我们引入了Hybrid Neural Representation for Videos(HNeRV),它使用内容自适应嵌入。然而,HNeRV的压缩比仍然相对较低,这是由于在利用网络的浅层特征和跨帧残差信息方面存在疏漏。在这项工作中,我们引入了一种先进的U型架构,称为Vector Quantized-NeRV(VQ-NeRV),它包含一个新颖的组件——VQ-NeRV块。这个块采用了一种有效的编码方案来离散化网络的浅层残差特征和跨帧残差信息。这种方法在视频压缩方面尤其优越,因为结果是相比量化特征更小的尺寸。此外,我们还引入了一种原始代码本优化技术,称为浅层代码本优化,旨在优化代码本的效用和效率。实验评估结果表明,VQ-NeRV在视频回归任务中优于HNeRV,实现了卓越的重建质量(在峰值信号-噪声比(PSNR)上增加1-2 dB),更好的每像素(bpp)效率和改善的视频修复效果。
https://arxiv.org/abs/2403.12401
Face inpainting, the technique of restoring missing or damaged regions in facial images, is pivotal for applications like face recognition in occluded scenarios and image analysis with poor-quality captures. This process not only needs to produce realistic visuals but also preserve individual identity characteristics. The aim of this paper is to inpaint a face given periocular region (eyes-to-face) through a proposed new Generative Adversarial Network (GAN)-based model called Eyes-to-Face Network (E2F-Net). The proposed approach extracts identity and non-identity features from the periocular region using two dedicated encoders have been used. The extracted features are then mapped to the latent space of a pre-trained StyleGAN generator to benefit from its state-of-the-art performance and its rich, diverse and expressive latent space without any additional training. We further improve the StyleGAN output to find the optimal code in the latent space using a new optimization for GAN inversion technique. Our E2F-Net requires a minimum training process reducing the computational complexity as a secondary benefit. Through extensive experiments, we show that our method successfully reconstructs the whole face with high quality, surpassing current techniques, despite significantly less training and supervision efforts. We have generated seven eyes-to-face datasets based on well-known public face datasets for training and verifying our proposed methods. The code and datasets are publicly available.
面部修复技术,即在面部图像中恢复缺失或受损区域的算法,对于应用如在遮挡场景下进行面部识别和低质量图像分析来说至关重要。这一过程不仅要产生逼真的视觉效果,还应保留个体的身份特征。本文旨在通过一种基于提出的新生成对抗网络(GAN)模型,即Eyes-to-Face Network(E2F-Net),对给定的外侧眼部区域(从眼睛到脸)进行修复。该方法通过使用两个专门编码器从外侧眼部区域提取身份和无关特征。提取的特征随后映射到预训练的StyleGAN生成器的潜在空间,以利用其最先进的性能和丰富的、多样化和表现力的潜在空间而无需额外训练。我们进一步通过新GAN反向优化技术对StyleGAN输出进行优化,以找到在潜在空间中最佳的代码。我们的E2F-Net需要最小训练过程,作为其次要好处,从而降低计算复杂度。通过广泛的实验,我们发现,我们的方法在高质量地重构整个面部,超越现有技术,尽管训练和监督 efforts大大减少。我们已经基于知名公共面部数据集生成七个眼睛-to-face数据集,用于训练和验证我们所提出的方法。代码和数据集都是公开可用的。
https://arxiv.org/abs/2403.12197
Recent advancements in video generation have been remarkable, yet many existing methods struggle with issues of consistency and poor text-video alignment. Moreover, the field lacks effective techniques for text-guided video inpainting, a stark contrast to the well-explored domain of text-guided image inpainting. To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility. Specifically, we introduce a simple but efficient motion capture module to preserve motion consistency, and design an instance-aware region selection instead of a random region selection to obtain better textual controllability, and utilize a novel strategy to inject some personalized models into our CoCoCo model and thus obtain better model compatibility. Extensive experiments show that our model can generate high-quality video clips. Meanwhile, our model shows better motion consistency, textual controllability and model compatibility. More details are shown in [this http URL](this http URL).
近年来在视频生成方面的进展令人印象深刻,然而,许多现有方法在一致性和文本-视频对齐方面存在问题。此外,该领域缺乏有效的文本指导视频修复技术,与文本指导图像修复领域已被充分探索的领域形成鲜明对比。为此,本文提出了一种新颖的文本指导视频修复模型,实现了更好的一致性、可控制性和兼容性。具体来说,我们引入了一个简单但高效的动作捕捉模块来保留运动一致性,并设计了一个实例感知区域选择,而不是随机区域选择,以获得更好的文本控制性,并利用一种新颖的方法将一些个性化的模型注入到我们的CoCoCo模型中,从而实现更好的模型兼容性。大量实验结果表明,我们的模型可以生成高质量的视频剪辑。同时,我们的模型在运动一致性、文本控制性和模型兼容性方面表现更好。更多细节可见于[http://www.thisurl.com](http://www.thisurl.com)。
https://arxiv.org/abs/2403.12035
Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.
扩散模型是图像和视频合成进步的主要驱动力,但它们在推理速度方面较慢。像最近引入的对抗扩散蒸馏(ADD)方法旨在将模型从多视角到单步推理的转变,尽管这需要由于其依赖固定预训练的DINOv2判别器而产生昂贵且难以优化。我们引入了潜在对抗扩散蒸馏(LADD),一种克服ADD局限性的新蒸馏方法。与基于像素的ADD不同,LADD利用预训练的潜在扩散模型的生成特征。这种方法简化了训练,提高了性能,使得能够实现高分辨率的多角度图像合成。我们将LADD应用于Stable Diffusion 3(8B),以获得SD3-Turbo,一种仅使用四个无指导采样步骤就能与最先进的文本到图像生成器的性能相匹敌的快速模型。此外,我们系统地研究了其扩展行为,并在图像编辑和修复等各种应用中证明了LADD的有效性。
https://arxiv.org/abs/2403.12015
Text-to-texture synthesis has become a new frontier in 3D content creation thanks to the recent advances in text-to-image models. Existing methods primarily adopt a combination of pretrained depth-aware diffusion and inpainting models, yet they exhibit shortcomings such as 3D inconsistency and limited controllability. To address these challenges, we introduce InteX, a novel framework for interactive text-to-texture synthesis. 1) InteX includes a user-friendly interface that facilitates interaction and control throughout the synthesis process, enabling region-specific repainting and precise texture editing. 2) Additionally, we develop a unified depth-aware inpainting model that integrates depth information with inpainting cues, effectively mitigating 3D inconsistencies and improving generation speed. Through extensive experiments, our framework has proven to be both practical and effective in text-to-texture synthesis, paving the way for high-quality 3D content creation.
文本到纹理合成已成为3D内容创作的新前沿,得益于最近在文本到图像模型方面的进步。现有的方法主要采用预训练的深度感知扩散和修复模型,但它们存在一些缺陷,如3D不一致性和可控制性有限。为了应对这些挑战,我们引入了InteX,一种交互式的文本到纹理合成框架。1) InteX包括一个用户友好的界面,在整个合成过程中促进交互和控制,实现区域特定的重新绘制和精确纹理编辑。2) 此外,我们开发了一个统一的深度感知修复模型,将深度信息与修复提示相结合,有效缓解3D不一致性并提高生成速度。通过大量的实验,我们的框架在文本到纹理合成上被证明既实用又有效,为高质量3D内容创作铺平道路。
https://arxiv.org/abs/2403.11878