The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
从文本或视觉输入生成高质量的三维(3D)资产已成为现代生成模型的核心目标。尽管出现了许多3D生成算法,但它们仍然面临着多视角不一致、生成时间长、保真度低以及表面重构问题等挑战。虽然一些研究已经解决了一部分这些问题,但仍缺乏一个全面的解决方案。在这篇文章中,我们介绍了\textbf{CaPa}(雕刻与绘制框架),这是一个能够高效生成高保真3D资产的系统。 CaPa采用了一个两阶段的过程,将几何生成和纹理合成解耦开来。在第一阶段,使用一个多视角输入引导的3D潜在扩散模型来生成几何结构,确保从不同视角观察时的一致性。第二阶段则利用了一种新颖且不依赖特定模型的空间分离注意力机制(Spatially Decoupled Attention),为给定的几何体合成高分辨率纹理(最高可达4K)。此外,我们还提出了一种3D感知的遮挡修复算法,用于填充未绘制区域,确保整个模型的一致性。该流程可以在不到30秒的时间内生成高质量的3D资产,并提供可以直接应用于商业用途的输出结果。 实验结果显示,CaPa在纹理保真度和几何稳定性方面均表现出色,为实用且可扩展的3D资产生成设定了新的标准。
https://arxiv.org/abs/2501.09433
We propose a new continuous video modeling framework based on implicit neural representations (INRs) called ActINR. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR's biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including $10\times$ video slow motion, $4\times$ spatial super resolution along with $2\times$ slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.
我们提出了一种基于隐式神经表示(INRs)的新连续视频建模框架,名为ActINR。我们的方法的核心在于观察到INRs可以被视为一种可学习的字典,其中基础函数的形状由INR的权重控制,而它们的位置则由偏置值决定。鉴于紧凑而非线性的激活函数,我们假设INR中的偏置值适合捕捉图像间的运动,并有助于视频序列的紧凑表示。基于这些观察,我们设计了ActINR,在视频序列的不同帧之间共享INR权重,但为每一帧使用独特的偏置值。进一步地,我们将偏置视为受时间索引条件约束的独立INR的输出,以促进平滑性。通过共同训练视频INR和该偏置INR,我们展示了ActINR的独特能力,包括10倍慢动作、4倍空间超分辨率结合2倍慢动作、去噪以及视频修补功能。 在多种视频处理任务中,ActINR表现出色(通常超过6dB的改善),为连续视频建模设定了新的标准。
https://arxiv.org/abs/2501.09277
State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.
最先进的监督立体匹配方法在各种基准测试中取得了令人印象深刻的结果。然而,由于缺乏真实世界的数据注释,这些数据驱动的方法难以推广到现实场景中。为此,本文提出了一种名为StereoGen的新颖管道,用于高质量的立体图像生成。该管道利用任意单张图像作为左图,并使用单目深度估计模型生成伪视差来合成高质量对应的右图。与之前的方法不同的是,后者在处理扭曲后的右图中的遮挡区域时,要么随机填充背景,要么选择性地应用卷积操作以考虑邻近像素的影响,我们则通过微调扩散修复模型(diffusion inpainting model)来恢复背景。由我们的模型生成的图像具有更好的细节和未受损的语义结构。 此外,我们提出了两种新的方法:无训练自信生成(Training-free Confidence Generation)和自适应视差选择(Adaptive Disparity Selection)。前者可以抑制立体匹配训练过程中有害伪标签所带来的负面影响,而后者则有助于生成更宽泛的视差点分布,并提高合成图像的质量。实验表明,在我们的管道下训练出的模型在所有已发表的方法中取得了最先进的零样本推广结果。 该论文发布后代码将公开可用。
https://arxiv.org/abs/2501.08654
Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.
生成式AI在创意艺术到科学可视化等多个领域中展现出变革性的潜力。然而,由于包括解剖不准确、物体放置不当和文本元素错位在内的视觉缺陷,AI生成的图像的实际效用常常受到限制。这些问题对实际应用构成了重大挑战。为了解决这些局限性,我们引入了名为“Yuan”的全新框架,该框架能够自动纠正文本到图像合成中的视觉瑕疵。 **Yuan的独特之处在于它不仅根据文本提示工作,还结合分割后的图像信息进行操作,生成精确的掩码来识别需要改进的部分,而无需人工干预——这在以前的方法中是一个常见的限制因素。** 在自动化遮罩过程之后,“Yuan”通过先进的修复模块将语义上连贯的内容无缝集成到这些区域中,保持了原始图像和相关文本提示的完整性和保真度。 通过对包括ImageNet100和斯坦福狗在内的公共数据集以及一个自定义生成的数据集进行广泛的实验,“Yuan”展示了在消除视觉缺陷方面卓越的表现。我们的方法在NIQE、BRISQUE和PI等定量指标上均取得了更高的评分,并且在定性评估中也表现出了优势。 这些结果强调了“Yuan”的潜力,它能够显著提升AI生成图像的质量和应用范围,在多个领域内具有重要的实际价值。
https://arxiv.org/abs/2501.08505
Hyperspectral images are typically composed of hundreds of narrow and contiguous spectral bands, each containing information regarding the material composition of the imaged scene. However, these images can be affected by various sources of noise, distortions, or data loss, which can significantly degrade their quality and usefulness. This paper introduces a convergent guaranteed algorithm, LRS-PnP-DIP(1-Lip), which successfully addresses the instability issue of DHP that has been reported before. The proposed algorithm extends the successful joint low-rank and sparse model to further exploit the underlying data structures beyond the conventional and sometimes restrictive unions of subspace models. A stability analysis guarantees the convergence of the proposed algorithm under mild assumptions , which is crucial for its application in real-world scenarios. Extensive experiments demonstrate that the proposed solution consistently delivers visually and quantitatively superior inpainting results, establishing state-of-the-art performance.
高光谱图像通常由数百个狭窄且连续的光谱带组成,每个光谱带包含有关成像场景中材料成分的信息。然而,这些图像可能受到各种噪声、失真或数据丢失的影响,从而显著降低其质量和实用性。本文介绍了一种收敛保证算法LRS-PnP-DIP(1-Lip),该算法成功解决了之前报告的DHP(深度高光谱处理)方法的稳定性问题。所提出的算法扩展了成功的联合低秩和稀疏模型,进一步挖掘超越传统且有时限制性的子空间联合模型的数据结构。稳定性和收敛性分析在适度假设下保证了算法的有效性,这对其实用场景中的应用至关重要。广泛的实验表明,所提出的方法能持续提供视觉上和定量上的优越图像修复结果,并建立了当前的最优性能水平。
https://arxiv.org/abs/2501.08195
Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images. For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights. We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons. An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case. Most models suffer from limited ground truth in raw data, \eg recorded air traffic or frontal flight with a small aircraft. It often leads to poor and critical detection rates. We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset. We validate our method by generating a high-resolution dataset, which we make publicly available and present it to an independent object detector that was fully trained on real data.
现代机器学习技术在对象检测方面展现出了巨大的潜力,特别是在处理摄像头图像时。因此,这些技术也被用于启用像自主无人机飞行这类安全关键的自动化流程。我们进行了一项研究,探讨了针对“Detect and Avoid”(识别与避让)功能的对象检测问题,“Detect and Avoid”是为无人机在自动飞行过程中因安全原因而需要检测空中交通的安全关键性任务。 一个棘手的问题是如何生成优质且大规模的数据集,因为对象检测本身便是稀有情况。大多数模型都会受限于原始数据中的有限真实标签,例如记录的空中交通或与小型飞机正面飞行的情况。这通常会导致较差和至关重要的检测率。我们通过使用“inpainting”(图像修复)方法来解决这个问题,这种方法可以扩充数据集,并确保其包含原始数据中的边缘情况。 我们将概述几种图像修复技术和生成模型,并展示如何利用一个较小的标注数据集构建一条样本工作流程。为了验证我们的方法的有效性,我们生成了一个高分辨率的数据集,并将其公开发布以供其他研究者使用。此外,我们将该数据集提供给一个完全基于真实数据训练的对象检测器进行测试。
https://arxiv.org/abs/2501.08142
The rapid advancements in generative models, particularly diffusion-based techniques, have revolutionized image inpainting tasks by enabling the generation of high-fidelity and diverse content. However, object removal remains under-explored as a specific subset of inpainting, facing challenges such as inadequate semantic understanding and the unintended generation of artifacts. Existing datasets for object removal often rely on synthetic data, which fails to align with real-world scenarios, limiting model performance. Although some real-world datasets address these issues partially, they suffer from scalability, annotation inefficiencies, and limited realism in physical phenomena such as lighting and shadows. To address these limitations, this paper introduces a novel approach to object removal by constructing a high-resolution real-world dataset through long-duration video capture with fixed camera settings. Leveraging advanced tools such as Grounding-DINO, Segment-Anything-Model, and MASA for automated annotation, we provides image, background, and mask pairs while significantly reducing annotation time and labor. With our efficient annotation pipeline, we release the first fully open, high-resolution real-world dataset for object removal, and improved performance in object removal tasks through fine-tuning of pre-trained diffusion models.
生成模型,尤其是基于扩散的方法的快速进步,已经通过允许高保真和多样化的内容生成革新了图像修复任务。然而,作为修复的一个特定子集的对象移除仍然鲜有研究,面临着诸如语义理解不足以及无意中生成伪影等挑战。现有的对象移除数据集通常依赖于合成数据,这与真实场景不一致,限制了模型的性能。尽管有些现实世界的数据集部分解决了这些问题,但它们在可扩展性、标注效率和物理现象(如光照和阴影)的真实感方面仍存在局限。 为了解决这些局限性,本文提出了一种新的对象移除方法,通过长时间固定摄像机设置拍摄构建了一个高分辨率的现实世界数据集。利用先进的工具如Grounding-DINO、Segment-Anything-Model和MASA进行自动化标注,我们提供了图像、背景和掩码对,并显著减少了标注时间和劳动力。借助我们的高效标注流水线,我们发布了第一个完全开放源代码的、高分辨率的真实场景对象移除数据集,并通过微调预训练的扩散模型提高了对象移除任务中的性能。
https://arxiv.org/abs/2501.07397
Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.
安全关键型应用(如自动驾驶)需要大量多模态数据进行严格的测试。基于合成数据的方法因其收集真实世界数据的成本和复杂性而日益受到重视,但这些方法需要高度的逼真度和可控性才能发挥作用。本文介绍了 MObI,这是一种新的框架,用于跨感知模式的多模态对象修复(Multimodal Object Inpainting),该框架利用扩散模型来创建逼真的、可控制的对象修复结果,在相机图像和激光雷达数据上同时进行了演示。使用单一参考RGB图像,MObI能够在由边界框指定的3D位置将物体无缝插入现有的多模态场景中,并保持语义一致性和跨模式的一致性。不同于仅依赖编辑掩码的传统修复方法,我们的3D边界框条件设定使对象具有准确的空间定位和现实的比例大小。因此,本方法可以灵活地向多模态场景中插入新的对象,为测试感知模型提供了显著的优势。
https://arxiv.org/abs/2501.03173
As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
随着人工智能的迅速发展,特别是生成对抗网络(GANs)和扩散模型的出现,图像修复定位(IIL)的准确性变得越来越具有挑战性。当前的IIL方法面临两个主要问题:过度自信导致错误预测;以及难以检测修补图像中细微篡改边界的问题。为了解决这些问题,我们提出了一种新的范式,将IIL视为一种利用扩散模型进行条件掩码生成的任务。我们的方法InpDiffusion通过增强图像语义条件的去噪过程逐步细化预测结果。在去噪过程中,我们使用边缘条件并引入了新颖的边缘监督策略来提升模型对修补对象中边缘细节的理解。 平衡扩散模型中的随机采样与篡改区域的边缘监督可以降低由于过度自信而导致错误预测的风险,并防止过于随机的过程导致细微边界损失的问题。此外,我们提出了一种创新性的双流多尺度特征提取器(DMFE),用于提取多种规模下的特征,并通过同时考虑修补图像的语义和边缘条件来增强特征表示能力。 在具有挑战性的数据集上进行的大量实验表明,InpDiffusion方法在IIL任务中显著优于现有的最先进方法,在性能、泛化能力和鲁棒性方面表现出色。
https://arxiv.org/abs/2501.02816
We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.
我们介绍了ACE++,这是一个基于指令的扩散框架,用于处理各种图像生成和编辑任务。受FLUX.1-Fill-dev提出的修补任务输入格式的启发,我们在ACE中引入了长上下文条件单元(LCU),并将其输入范式扩展到任何编辑和生成任务。为了充分利用图像生成先验知识,我们开发了一个两阶段训练方案,以最小化微调像FLUX.1-dev这样的强大文本到图像扩散模型的工作量。 在第一阶段,我们使用来自文本到图像模型的0-ref任务数据对模型进行预训练。社区中有很多基于后训练文本到图像基础模型的方法符合这一第一阶段的训练范式。例如,FLUX.1-Fill-dev主要处理绘画任务,并可以作为加速训练过程的初始化。 在第二阶段,我们将上述模型微调为支持ACE定义的所有通用指令的任务。 为了促进ACE++在不同场景中的广泛应用,我们提供了一整套涵盖全量微调和轻量级微调的模型,同时考虑其通用性和特定垂直领域的适用性。定性的分析展示了ACE++在图像生成质量和提示响应能力方面的优越性。
https://arxiv.org/abs/2501.02487
In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.
在基于参考的图像修复任务中,提供了一张额外的参考图片来帮助恢复受损的目标图片至其原始状态。扩散模型(如Stable Diffusion)的进步使得在这个任务上可以采用相对简单的公式化方法。然而,现有的基于扩散的方法通常缺乏对参考图片和受损图片之间相关性的显式约束,导致修复结果对参考图片的真实度较低。 为此,我们提出了CorrFill模块,这是一个无需训练的组件,旨在增强参考图像与目标图像之间的几何关联意识。这一改进是通过在修复过程中引入对应关系约束来实现的,这些约束是在修复过程本身中估计出来的,并且利用自注意力层中的注意屏蔽和一个目标函数根据这些约束更新输入张量。 实验结果表明,CorrFill显著提升了多种基线扩散方法(包括最先进方法)的表现,在强调对参考图片的真实度方面尤为突出。
https://arxiv.org/abs/2501.02355
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
面部恢复(FR)是图像和视频处理中的一个关键领域,专注于从退化的输入中重建高质量的肖像。尽管在图像FR方面已经取得了进展,但视频FR仍相对较少被探索,主要是由于时间一致性、运动伪影以及高质量视频数据稀缺等方面的挑战。此外,传统的面部恢复通常侧重于提高分辨率,而对诸如面部着色和修补等相关的任务考虑不足。 在这篇论文中,我们提出了一种新的方法来解决广义视频面部恢复(GVFR)问题,这种方法集成了视频BFR、修补和着色任务,并且我们实证表明这些任务之间可以相互促进。我们提出了一个统一的框架,称为稳定视频面部恢复(SVFR),该框架利用了Stable Video Diffusion (SVD) 的生成性和运动先验,并通过一个统一的面部恢复框架整合特定于任务的信息。引入了一种可学习的任务嵌入来增强任务识别能力。同时,我们采用一种新颖的统一潜在正则化(ULR)方法,以鼓励不同子任务之间共享特征表示的学习。 为了进一步提高恢复质量和时间稳定性,我们提出了面部先验学习和自参照精炼作为训练和推理中的辅助策略。所提出的框架有效地结合了这些任务之间的互补优势,增强了时间连贯性,并实现了卓越的恢复质量。这项工作推动了视频FR领域的最先进水平,并为广义视频面部恢复确立了一个新的范式。
https://arxiv.org/abs/2501.01235
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-Level controlled Image Generation. We introduce regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both positional control precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending EliGen to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with community models such as IP-Adapter and MLLM, unlocking new creative possibilities. The source code, dataset, and model will be released publicly.
最近在扩散模型方面的进展显著提高了从文本到图像生成的效果,但仅凭全局性文字提示仍不足以实现对图像中个别实体的精细控制。为了解决这一局限性,我们提出了EliGen,这是一种用于基于实体级别的图像生成的新框架。我们引入了区域注意力机制,这种机制适用于扩散变压器,并且无需额外参数即可将实体提示和任意形状的空间遮罩无缝集成在一起。 通过贡献一个高质量的数据集,该数据集包含精细的、具有空间和语义信息的实体级注释,我们训练EliGen以实现稳健而准确的实体级别操作控制,在位置精确度和图像质量方面超越了现有的方法。此外,我们还提出了一种用于多实体图像修复任务的修复融合管道,扩展了EliGen的功能范围。通过将其与IP-Adapter和MLLM等社区模型集成,进一步展示了其灵活性,并解锁了新的创意可能性。 源代码、数据集和模型将公开发布。
https://arxiv.org/abs/2501.01097
In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
在视频压缩领域,追求更低比特率下的更高画质一直是长期的目标。近期的发展展示了隐式神经表示(INR)作为传统基于变换的方法的有希望替代方案的巨大潜力。根据网络输出的结构,视频 INRs 可大致分为帧级和像素级方法。虽然基于像素的方法在上采样和平行化方面更具优势,但帧级方法表现出更好的性能。我们引入了一种新的像素级INR——CoordFlow,用于视频压缩,并且其在与其它像素级INRs相比时达到了最先进的水平,在与领先的帧级技术比较时也表现出了相当的水准。该方法基于将视觉信息分解为视觉上一致的层,每一层都由一个专用网络表示并补偿该层的运动。集成后,可以产生无监督的视频序列分割作为副产品。物体的运动轨迹被隐式地用于补偿视觉-时间冗余。此外,所提出的方法还提供了固有的视频上采样、稳定化、修复和去噪能力。
https://arxiv.org/abs/2501.00975
In recent years, it has become popular to tackle image restoration tasks with a single pretrained diffusion model (DM) and data-fidelity guidance, instead of training a dedicated deep neural network per task. However, such "zero-shot" restoration schemes currently require many Neural Function Evaluations (NFEs) for performing well, which may be attributed to the many NFEs needed in the original generative functionality of the DMs. Recently, faster variants of DMs have been explored for image generation. These include Consistency Models (CMs), which can generate samples via a couple of NFEs. However, existing works that use guided CMs for restoration still require tens of NFEs or fine-tuning of the model per task that leads to performance drop if the assumptions during the fine-tuning are not accurate. In this paper, we propose a zero-shot restoration scheme that uses CMs and operates well with as little as 4 NFEs. It is based on a wise combination of several ingredients: better initialization, back-projection guidance, and above all a novel noise injection mechanism. We demonstrate the advantages of our approach for image super-resolution, deblurring and inpainting. Interestingly, we show that the usefulness of our noise injection technique goes beyond CMs: it can also mitigate the performance degradation of existing guided DM methods when reducing their NFE count.
近年来,图像恢复任务通常采用单一的预训练扩散模型(DM)和数据保真度指导来完成,而不是为每个任务单独训练深度神经网络。然而,这种"零样本"恢复方案目前需要大量的神经功能评估(NFEs)才能表现良好,这可能归因于原始生成功能所需的大量NFEs。最近,研究人员探索了用于图像生成的更快变体的扩散模型(DM),其中包括一致性模型(CM),可以通过少量NFEs生成样本。然而,现有的使用指导性CM进行恢复的工作仍然需要数十个NFEs或针对每个任务微调模型,如果在微调过程中的假设不准确,则会导致性能下降。 在本文中,我们提出了一种零样本恢复方案,该方案使用一致性模型并在仅4个NFE的情况下也能很好地运行。我们的方法基于几种成分的明智组合:更好的初始化、反向投影指导以及最重要的是一个新颖的噪声注入机制。我们在图像超分辨率、去模糊和修复(即去噪填充)方面展示了我们方法的优势。有趣的是,我们证明了我们的噪声注入技术不仅适用于一致性模型,在减少现有引导式扩散模型方法的NFE计数时也可以缓解其性能下降问题。
https://arxiv.org/abs/2412.20596
Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality's mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.
跨模态学习已在各种临床任务中展示了其提升性能的能力,这得益于不同数据模式提供的多样化视角。然而,现有的跨模态分割方法依赖于精确配准的多模态数据,在实际临床图像(尤其是肝脏肿瘤等不清晰和弥漫性区域)应用中难以实现这一要求。在本文中,我们提出了Diff4MMLiTS,这是一种四阶段的跨模态肝脏肿瘤分割流水线:首先是对多模CT中的目标器官进行预配准;然后是注释模式掩码的膨胀,并使用其进行图像修补以获得没有肿瘤的正常多模CT;接下来是在多模CT特征和随机生成的肿瘤掩码基础上,利用基于潜在扩散模型的方法来合成严格对齐的带有肿瘤的多模CT;最后是训练分割模型,在此过程中不再需要严格的跨模态数据对准。在公共和内部数据集上的广泛实验表明,Diff4MMLiTS优于其他最先进的跨模态分割方法。
https://arxiv.org/abs/2412.20418
Inverse problems exist in many disciplines of science and engineering. In computer vision, for example, tasks such as inpainting, deblurring, and super resolution can be effectively modeled as inverse problems. Recently, denoising diffusion probabilistic models (DDPMs) are shown to provide a promising solution to noisy linear inverse problems without the need for additional task specific training. Specifically, with the prior provided by DDPMs, one can sample from the posterior by approximating the likelihood. In the literature, approximations of the likelihood are often based on the mean of conditional densities of the reverse process, which can be obtained using Tweedie formula. To obtain a better approximation to the likelihood, in this paper we first derive a closed form formula for the covariance of the reverse process. Then, we propose a method based on finite difference method to approximate this covariance such that it can be readily obtained from the existing pretrained DDPMs, thereby not increasing the complexity compared to existing approaches. Finally, based on the mean and approximated covariance of the reverse process, we present a new approximation to the likelihood. We refer to this method as covariance-aware diffusion posterior sampling (CA-DPS). Experimental results show that CA-DPS significantly improves reconstruction performance without requiring hyperparameter tuning. The code for the paper is put in the supplementary materials.
逆问题存在于科学和工程学的许多学科中。例如,在计算机视觉领域,图像修复(inpainting)、去模糊(deblurring)和超分辨率(super resolution)等任务可以被有效建模为逆问题。最近的研究表明,去噪扩散概率模型(denoising diffusion probabilistic models, DDPMs)为解决线性逆问题提供了一种有前景的解决方案,而无需额外的任务特定训练。具体来说,利用DDPM提供的先验知识,可以通过近似似然值来从后验分布中采样。 在相关文献中,对于似然值的近似通常基于反向过程条件密度均值进行,这可以通过Tweedie公式获得。为了得到更好的似然值近似,在这篇论文中,我们首先推导出反向过程协方差的闭式公式。然后,我们提出了一种基于有限差分方法的方法来逼近这个协方差,使得它可以从现有的预训练DDPMs中轻易获取,从而不会增加与现有方法相比的复杂度。最后,在利用反向过程均值和近似协方差的基础上,我们提出了一个新的对似然值的近似方式。我们将这种方法称为协方差感知扩散后验采样(Covariance-Aware Diffusion Posterior Sampling, CA-DPS)。 实验结果显示,CA-DPS显著提高了重建性能,并且不需要进行超参数调优。该论文的相关代码被放在了补充材料中。
https://arxiv.org/abs/2412.20045
Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.
最近关于主题驱动生成的研究越来越强调选择性主体特征的重要性。然而,准确地从给定的参考图像中选取内容仍然面临挑战,尤其是在从图像中挑选相似的对象(例如两只不同的狗)时。一些方法试图使用文本提示或像素遮罩来隔离特定元素。不过,文本提示通常难以精确描述具体内容,而像素遮罩的成本又往往很高。为了解决这个问题,我们引入了P3S-Diffusion,这是一种新型架构,旨在通过点监督进行上下文选择的主题驱动生成。 P3S-Diffusion利用最少的标签成本(例如点)来生成主题驱动图像。在微调过程中,它可以从这些点生成扩展的基础遮罩,从而不需要额外的分割模型。该遮罩用于描边和与对象表示对齐。通过多层条件注入,P3S-Diffusion能够保留主体的精细特征。借助注意力一致性损失进行增强以改进训练过程,广泛的实验表明其具有出色的特征保持能力和图像生成能力。
https://arxiv.org/abs/2412.19533
Photo-realistic scene reconstruction from sparse-view, uncalibrated images is highly required in practice. Although some successes have been made, existing methods are either Sparse-View but require accurate camera parameters (i.e., intrinsic and extrinsic), or SfM-free but need densely captured images. To combine the advantages of both methods while addressing their respective weaknesses, we propose Dust to Tower (D2T), an accurate and efficient coarse-to-fine framework to optimize 3DGS and image poses simultaneously from sparse and uncalibrated images. Our key idea is to first construct a coarse model efficiently and subsequently refine it using warped and inpainted images at novel viewpoints. To do this, we first introduce a Coarse Construction Module (CCM) which exploits a fast Multi-View Stereo model to initialize a 3D Gaussian Splatting (3DGS) and recover initial camera poses. To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model. Then, a Warped Image-Guided Inpainting (WIGI) module is proposed to warp the training images to novel viewpoints by the refined depth maps, and inpainting is applied to fulfill the ``holes" in the warped images caused by view-direction changes, providing high-quality supervision to further optimize the 3D model and the camera poses. Extensive experiments and ablation studies demonstrate the validity of D2T and its design choices, achieving state-of-the-art performance in both tasks of novel view synthesis and pose estimation while keeping high efficiency. Codes will be publicly available.
从稀疏且未校准的图像中重建逼真的场景在实践中极为重要。尽管已经取得了一些成功,现有的方法要么需要准确的相机参数(即内部和外部参数),要么不需要结构从运动(SfM)但需要密集捕捉的图像。为了结合这两种方法的优点并解决它们各自的缺点,我们提出了尘埃到塔楼(Dust to Tower, D2T),这是一种精确且高效的粗到细框架,可以从稀疏且未校准的图像同时优化3D高斯散射(3DGS)和图像姿态。 我们的核心理念是首先高效地构建一个粗糙模型,并通过使用扭曲和修复后的图像在新视角下对其进行细化。为此,我们首先引入了快速多视图立体匹配模型来初始化一个3D高斯散射(3DGS)并恢复初始相机姿态的粗构造模块(CCM)。 为了在新的视角下对三维模型进行精化,我们提出了自信感知深度对齐(CADA)模块,通过将它们可信的部分与单目深度模型估计出的深度进行对齐来精细粗糙的深度图。然后,我们提出了一种扭曲图像引导填充(WIGI)模块,该模块利用细化后的深度图将训练图像扭曲到新视角,并应用修复技术填补由于视点变化造成的“空洞”,从而为三维模型和相机姿态提供高质量的监督,进一步优化它们。 大量的实验和消融研究表明了D2T的有效性和设计选择的合理性,在新的视图合成和姿态估计任务中达到了最先进的性能,同时保持了高效率。代码将会公开发布。
https://arxiv.org/abs/2412.19518
Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject's face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject's body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.
近距离拍摄的人像或自拍照通常会受到透视失真的影响。在本文中,我们提出了一种基于端到端深度学习的校正流水线,以减轻透视失真的效果。通过训练一个深层卷积神经网络(CNN),我们学会了预测面部深度,并利用估计出的深度来调整相机与主体之间的距离:将相机移得更远、增加镜头焦距,并重新投影3D图像特征到新的视角。随后,这些重新投影的特征被输入一个修复模块中以填补缺失的像素。为了实现端到端训练我们的深度估计和特征提取网络,并改进校正后的输出效果,我们利用了一种可微渲染器(differentiable renderer)。为提升修复模块的结果,我们添加了一个辅助模块来预测相机的水平移动,从而减少需要对诸如耳朵等复杂面部部分进行想象的区域。 不同于先前的工作,我们的方法一次性处理全帧输入图像而不裁剪主体的脸部,并将脸部与其他身体部位分开处理。因此,无需复杂的后期处理步骤以重新连接脸部与身体。为了训练网络,我们利用流行的视频游戏引擎Unreal Engine生成了一个包含多种人物、头部姿态、表情、眼镜和衣服等的大规模合成面部数据集。 定量和定性结果表明,我们的校正流水线优于先前的方法,并且可以产生与耗时的基于3D GAN方法类似的结果,但速度要快260多倍。
https://arxiv.org/abs/2412.19189