Object removal refers to the process of erasing designated objects from an image while preserving the overall appearance, and it is one area where image inpainting is widely used in real-world applications. The performance of an object remover is quantitatively evaluated by measuring the quality of object removal results, similar to how the performance of an image inpainter is gauged. Current works reporting quantitative performance evaluations utilize original images as references. In this letter, to validate the current evaluation methods cannot properly evaluate the performance of an object remover, we create a dataset with object removal ground truth and compare the evaluations made by the current methods using original images to those utilizing object removal ground truth images. The disparities between two evaluation sets validate that the current methods are not suitable for measuring the performance of an object remover. Additionally, we propose new evaluation methods tailored to gauge the performance of an object remover. The proposed methods evaluate the performance through class-wise object removal results and utilize images without the target class objects as a comparison set. We confirm that the proposed methods can make judgments consistent with human evaluators in the COCO dataset, and that they can produce measurements aligning with those using object removal ground truth in the self-acquired dataset.
对象移除是指在保留图像整体外观的情况下,从图像中删除指定对象的过程,它是图像修复在现实应用中得到广泛使用的一个领域。对象移除算法的性能通过测量对象移除结果的质量进行定量评估,就像衡量图像修复性能一样。目前的工作报道了定量性能评估,它们使用原始图像作为参考。在本文中,为了验证当前评估方法不能正确评估对象移除算法的性能,我们创建了一个带有对象移除真实值的 dataset,并使用原始图像比较当前方法得到的评估结果和利用对象移除真实值图像得到的评估结果。两个评估集中的差异证实了当前方法不适合测量对象移除算法的性能。此外,我们提出了针对对象移除算法的性能评估的新方法。这些方法通过类级对象移除结果进行评估,并利用没有目标类物体作为比较集的图像。我们证实,所提出的方法可以在COCO数据集中让评估者做出一致的判断,并且可以产生与利用对象移除真实值数据集中的测量结果相一致的测量值。
https://arxiv.org/abs/2404.11104
The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
视频修复检测的任务是揭示视频中每个视频帧的像素级修复区域。现有的方法通常利用空间和时间不一致性来结合空间和时间提示。然而,这些方法通常采用固定的操作来结合空间和时间提示,限制了它们在不同场景中的应用。在本文中,我们引入了一种名为Multilateral Temporal-view Pyramid Transformer(MUMPy)的新颖方法,它灵活地合作空间和时间提示。我们的方法利用一个新的多边形时间视图编码器来提取各种空间-时间提示的合作,并引入了一个可变的窗口基于时间视图的交互模块,以增强这些合作的变化。接下来,我们开发了一个多层金字塔解码器来聚合各种特征并生成检测图。通过调整空间和时间提示的贡献强度,我们的方法可以有效地检测修复区域。我们在现有数据集上评估了我们的方法,并还基于YouTube-VOS数据集引入了一个新的具有挑战性和大规模的视频修复检测数据集,该数据集采用了一些更先进的修复方法。结果表明,在我们的方法和跨域评估场景中,我们的方法具有优越性。
https://arxiv.org/abs/2404.11054
Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.
神经重建方法正迅速成为3D场景的首选表示方法,但它们的可编辑性仍然存在挑战。在这项工作中,我们提出了一个3D场景修复方法--在修复场景中部分内容替换所需内容。场景修复是一个本质上有问题的任务,因为存在许多合理地替换缺失内容的解决方案。因此,一个好的修复方法应该不仅能够实现高质量合成,还应该具有高程度的控制力。基于这一观察,我们专注于实现对修复内容的显式控制,并利用参考图像作为实现这一目标的有效手段。具体来说,我们引入了RefFusion,一种基于图像修复扩散模型多尺度自适应的3D修复方法。自适应有效地将先验分布适应目标场景,导致评分蒸馏目标变低,因此显著地更清晰地突出细节。我们的框架在物体去除的同时保持高可控制性。我们还进一步证明了我们在其他下游任务上的通用性,例如物体插入、场景修复和稀疏视图重建。
https://arxiv.org/abs/2404.10765
Few-shot segmentation is a task to segment objects or regions of novel classes within an image given only a few annotated examples. In the generalized setting, the task extends to segment both the base and the novel classes. The main challenge is how to train the model such that the addition of novel classes does not hurt the base classes performance, also known as catastrophic forgetting. To mitigate this issue, we use SegGPT as our base model and train it on the base classes. Then, we use separate learnable prompts to handle predictions for each novel class. To handle various object sizes which typically present in remote sensing domain, we perform patch-based prediction. To address the discontinuities along patch boundaries, we propose a patch-and-stitch technique by re-framing the problem as an image inpainting task. During inference, we also utilize image similarity search over image embeddings for prompt selection and novel class filtering to reduce false positive predictions. Based on our experiments, our proposed method boosts the weighted mIoU of a simple fine-tuned SegGPT from 15.96 to 35.08 on the validation set of few-shot OpenEarthMap dataset given in the challenge.
少样本分割是在只有几篇注释示例的情况下,对图像中 novel 类别的对象或区域进行分割的任务。在扩展设置中,任务扩展到同时分割基础类和 novel 类别。主要挑战是训练模型,使得 novel 类别的添加不会损害基础类别的性能,也就是灾难性遗忘(catastrophic forgetting)。为了减轻这个问题,我们使用 SegGPT 作为基础模型,并在其基础上进行训练。然后,我们使用独立的可学习提示来处理每个 novel 类别的预测。为了处理遥感领域中通常存在的各种对象大小,我们采用基于补丁的预测。为了处理补丁边界上的不连续性,我们提出了通过重新将问题重构为图像修复任务来解决补丁和缝合技术。在推理过程中,我们还利用图像相似搜索来选择提示和过滤 novel 类别,以降低虚假阳性预测。根据我们的实验,我们对简单微调的 SegGPT 的加权 mIoU 在 few-shot OpenEarthMap 数据集的验证集上从 15.96 提高到了 35.08。
https://arxiv.org/abs/2404.10307
Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
在包括创意设计和电子商务在内的各个领域中,生成突出物场景对突出物在场景中的表现和上下文至关重要。通过将对象整合到定制环境中,可以增强主题的表现和上下文。生成背景的过程可以看作是一个文本条件下的修复绘画任务,其目标是将图像内容扩展到突出物的边界之外。尽管引导文本修复绘图模型(例如)也可以通过遮罩反向填充进行修复,但它们通过填充图像的缺失部分来修复图像,而不是将物体放入场景中。因此,当用于背景生成时,修复绘图模型经常扩展突出物的边界,从而改变物体的身份,这种现象我们称之为“物体膨胀”。本文介绍了一个使用Stable Diffusion和ControlNet架构将修复扩散模型适应突出物修复任务的模型。我们在模型和数据集上展示了的一系列定性和定量结果,包括一个不需要任何人类标注的新指标来衡量物体膨胀。与Stable Diffusion 2.0修复绘图相比,我们提出的方法在多个数据集上的标准视觉指标上减少了3.6倍的物体膨胀。
https://arxiv.org/abs/2404.10157
Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: this https URL
Neural Radiance Field(NeRF)是从多视角图像的三维重建表示。尽管一些最近的工作在编辑重新构建的NeRF时表明初步成功,但它们仍然难以在完全未覆盖的区域中合成合理的几何形状。一个主要原因是从扩散模型中合成内容的多样性,这阻碍了辐射场收敛到清晰和确定性几何。此外,在应用拉文迪格模型的真实数据时,由于自编码错误,通常会导致图像条件下的纹理平滑度转移。这些问题进一步得到像素距离损失的加剧。为了解决这些问题,我们通过每个场景的定制来调节扩散模型的随机性,并通过掩码对抗训练来减轻纹理平滑。在分析过程中,我们还发现,在NeRF修复任务中,通常使用的像素和感知损失是有害的。通过严谨的实验,我们的框架在各种真实世界场景中产生了最先进的NeRF修复结果。项目页面:https:// this URL
https://arxiv.org/abs/2404.09995
Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
准确地完成和去噪屋顶高度图对于重建高质量的3D建筑至关重要。修复稀疏点可以提高低成本传感器使用并减少无人机飞行重叠。RoofDiffusion是一种新的端到端自监督扩散技术,特别适用于完成艰难、高度不连续的屋顶高度图。RoofDiffusion利用广泛可用的心跳图,可以处理多达99%的点稀疏和80%的屋顶面积遮挡(区域不完整性)。一种变体,No-FP RoofDiffusion同时预测建筑轮廓和高度。在屋顶特定基准和BuildingNet数据集上,No-FP RoofDiffusion的定量效果超过了目前最先进的未经指导的深度完成和代表性的修复方法。定性评估显示,RoofDiffusion对于包括AHN3、Dales3D和USGS 3DEP LiDAR等现实世界扫描的数据集非常有效。使用领先的City3D算法进行测试,使用RoofDiffusion预处理屋顶图显著提高了3D建筑重建。RoofDiffusion通过一个新的具有13k个复杂屋顶几何的 datasets,重点关注遥感中的长尾问题;一种新的树遮挡模拟;以及各种大面积屋顶切口,用于数据增强和基准测试而得到了补充。
https://arxiv.org/abs/2404.09290
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
我们介绍了一种从文本描述生成通用面向未来的3D场景的技术,名为RealmDreamer。我们的技术通过优化3D高斯平铺表示来匹配复杂的文本提示。我们通过利用最先进的文本到图像生成器的状态,将样本提升到3D并计算遮挡体积。然后,在多个视角上对这种表示进行优化,将其作为3D修复任务与图像条件扩散模型一起进行。为了学习正确的几何结构,我们在修复模型上通过条件于修复模型的样本,从而赋予了丰富几何结构。最后,我们使用增强的生成器样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视角数据,可以生成不同风格的高质量3D场景,包括多个物体。此外,其普遍性还允许从单张图像合成3D。
https://arxiv.org/abs/2404.07199
Fractional Brownian trajectories (fBm) feature both randomness and strong scale-free correlations, challenging generative models to reproduce the intrinsic memory characterizing the underlying process. Here we test a diffusion probabilistic model on a specific dataset of corrupted images corresponding to incomplete Euclidean distance matrices of fBm at various memory exponents $H$. Our dataset implies uniqueness of the data imputation in the regime of low missing ratio, where the remaining partial graph is rigid, providing the ground truth for the inpainting. We find that the conditional diffusion generation stably reproduces the statistics of missing fBm-distributed distances for different values of $H$ exponent. Furthermore, while diffusion models have been recently shown to remember samples from the training database, we show that diffusion-based inpainting behaves qualitatively different from the database search with the increasing database size. Finally, we apply our fBm-trained diffusion model with $H=1/3$ for completion of chromosome distance matrices obtained in single-cell microscopy experiments, showing its superiority over the standard bioinformatics algorithms. Our source code is available on GitHub at this https URL.
分式布朗轨迹(fBm)具有随机性和强标度无关性,挑战生成模型对底层过程的固有记忆特征进行还原。在这里,我们在具有不同记忆指数$H$的完整欧氏距离矩阵的污损图像的特定数据集上对扩散概率模型进行测试。我们的数据集表明,在残差比低的情况下,数据缺失的鲁棒性是唯一的,而剩余的离散图是刚性的,为修复提供真实值。我们发现,条件扩散生成稳定地还原了不同$H$值下污损fBm分布的统计量。此外,扩散模型最近已经被证明可以从训练数据库中记住样本,但我们发现,随着数据库大小的增加,基于扩散的修复表现出与数据库搜索 qualitatively different的行为。最后,我们将$H=\frac{1}{3}$应用于从单细胞显微镜实验中获得的染色体距离矩阵的修复,证明了其优越性 over 标准生物信息学算法。我们的源代码可在此https URL上获取。
https://arxiv.org/abs/2404.07029
Diffusion models have shown remarkable results for image generation, editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions, i.e., signed distance function and occupancy function. However, they are limited to shapes with closed surfaces, which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation, which produces a compact representation space for UDF generation. Specifically, instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss, we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: this https URL.
扩散模型在图像生成、编辑和修复方面已经取得了显著的成果。最近的工作探索了使用神经隐函数(即距离函数和占有函数)进行3D形状生成的扩散模型。然而,它们仅限于具有封闭表面的形状,这使得它们无法生成包含开放表面的多样3D真实内容。在本文中,我们提出了UDiFF,一种用于无符号距离场(UDFs)的3D扩散模型,可以从文本条件或无条件情况下生成带有开放表面的纹理3D形状。我们的关键想法是在空间-频域中通过最优小波变换生成UDFs,这产生了一个紧凑的UDF生成表示空间。具体来说,我们提出了一个数据驱动的方法来学习最优的小波变换,这需要昂贵的手动努力,并且仍然会导致大量信息损失。我们通过与最新方法在广泛使用的基准上的数值和视觉比较来评估UDiFF,以展示我们的优势。页面链接:https://this URL。
https://arxiv.org/abs/2404.06851
We propose ZeST, a method for zero-shot material transfer to an object in the input image given a material exemplar image. ZeST leverages existing diffusion adapters to extract implicit material representation from the exemplar image. This representation is used to transfer the material using pre-trained inpainting diffusion model on the object in the input image using depth estimates as geometry cue and grayscale object shading as illumination cues. The method works on real images without any training resulting a zero-shot approach. Both qualitative and quantitative results on real and synthetic datasets demonstrate that ZeST outputs photorealistic images with transferred materials. We also show the application of ZeST to perform multiple edits and robust material assignment under different illuminations. Project Page: this https URL
我们提出了ZeST,一种在给定材料示例图像的输入图像中实现零散材料传输的方法。ZeST利用现有的扩散适配器从示例图像中提取隐含材料表示。这个表示用于在输入图像中的物体上使用预训练的修复扩散模型进行材料转移。该方法对真实图像进行处理,没有任何训练,实现零散传输。在真实和合成数据集上,ZeST生成的图像具有转移的材料。我们还展示了ZeST在不同的光照条件下进行多个编辑和鲁棒材料分配的应用。项目页面:https:// this URL
https://arxiv.org/abs/2404.06425
In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is architected to synchronize potential known class samples across both the labeled (source) and unlabeled (target) datasets, while emphasizing the distinct categorization of the target data. To facilitate this, we propose an entropy-driven adversarial learning strategy that accounts for the distance distributions of target samples relative to source-domain class prototypes. Parallelly, the discriminative nature of the shared space is upheld through a fusion of three metric learning objectives. In the source domain, our focus is on refining the proximity between samples and their affiliated class prototypes, while in the target domain, we integrate a neighborhood-centric contrastive learning mechanism, enriched with an adept neighborsmining approach. To further accentuate the nuanced feature interrelation among semantically aligned images, we champion the concept of conditional image inpainting, underscoring the premise that semantically analogous images prove more efficacious to the task than their disjointed counterparts. Experimentally, CDAD-NET eclipses existing literature with a performance increment of 8-15% on three AD-GCD benchmarks we present.
在通用类别发现(GCD)中,我们通过利用已知类别的训练数据对已知和未知的类别进行聚类。由于这些数据之间的领域转移,一个显著的挑战出现了。为了应对这个问题,我们提出了一个新场景:跨领域通用类别发现(AD-GCD)和类域发现器跨领域(CDAD-NET)作为解决方法。CDAD-NET旨在同步已知类别的样本在已知类别的(源)和未知的(目标)数据集中的潜在样本,同时强调目标数据的独特分类。为了促进这一目标,我们提出了一个熵驱动的对抗学习策略,考虑了目标样本与源领域类别原型之间的距离分布。同时,通过融合三个度量学习目标来维持共享空间判别性的特征。在源领域,我们的关注点是改进样本与相关类别原型之间的接近程度,而在目标领域,我们引入了一种以邻域为中心的对比学习机制,并支持一个智能邻居挖掘方法。为了进一步强调语义对齐图像之间细微特征之间的关联,我们倡导条件图像修复这一概念,强调语义类似于图像证明比它们的离散对应物更有效地完成任务。在实验中,CDAD-NET在三个AD-GCD基准测试上的性能提高了8-15%。
https://arxiv.org/abs/2404.05366
Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we present ByteEdit, an innovative feedback learning framework meticulously designed to Boost, Comply, and Accelerate Generative Image Editing tasks. ByteEdit seamlessly integrates image reward models dedicated to enhancing aesthetics and image-text alignment, while also introducing a dense, pixel-level reward model tailored to foster coherence in the output. Furthermore, we propose a pioneering adversarial and progressive feedback learning strategy to expedite the model's inference speed. Through extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses leading generative image editing products, including Adobe, Canva, and MeiTu, in both generation quality and consistency. ByteEdit-Outpainting exhibits a remarkable enhancement of 388% and 135% in quality and consistency, respectively, when compared to the baseline model. Experiments also verfied that our acceleration models maintains excellent performance results in terms of quality and consistency.
近年来,基于扩散的生成图像编辑的进步引发了一场深刻的变化,重新塑造了图像修复和去修复任务的格局。尽管取得了这些进步,该领域仍面临固有挑战,包括:i)低质量;ii) poor consistency;iii) insufficient instruction adherence;iv) suboptimal generation efficiency。为了应对这些障碍,我们提出了ByteEdit,一种专门设计的反馈学习框架,旨在提高、符合和加速生成图像编辑任务。ByteEdit无缝地将专为增强美观和图像文本对齐的图像奖励模型集成在一起,同时引入了一个密集的像素级别奖励模型,以促进输出的一致性。此外,我们提出了一个里程碑式的对抗和 progressive feedback learning 策略,以加速模型的推理速度。通过大量的大型用户评估,我们证明了ByteEdit在生成质量和一致性方面超过了包括Adobe、Canva和MeiTu在内的领先生成图像编辑产品。ByteEdit-Outpainting在质量和一致性方面分别显示了388%和135%的显著增强。实验还验证了我们的加速模型在质量和一致性方面的优异表现。
https://arxiv.org/abs/2404.04860
Recent advancements in diffusion models have shown remarkable proficiency in editing 2D images based on text prompts. However, extending these techniques to edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual 2D frames can result in inconsistencies across multiple views. Our crucial insight is that a NeRF scene's geometry can serve as a bridge to integrate these 2D edits. Utilizing this geometry, we employ a depth-conditioned ControlNet to enhance the coherence of each 2D image modification. Moreover, we introduce an inpainting approach that leverages the depth information of NeRF scenes to distribute 2D edits across different images, ensuring robustness against errors and resampling challenges. Our results reveal that this methodology achieves more consistent, lifelike, and detailed edits than existing leading methods for text-driven NeRF scene editing.
近年来,扩散模型的进步在基于文本提示编辑二维图像方面表现出了出色的能力。然而,将这种技术扩展到神经辐射场(NeRF)编辑场景中,则是复杂的,因为编辑单个2D帧会导致多个视图之间出现不一致性。我们关键的见解是,NeRF场景的几何可以作为整合这些2D编辑的桥梁。利用这种几何,我们采用深度条件控制网络增强每个2D图像修改的连贯性。此外,我们还引入了一种利用NeRF场景深度信息分布2D编辑到不同图像的修复方法,确保在错误和重新采样挑战方面的稳健性。我们的结果表明,这种方法实现了比现有领先方法更一致、更逼真、更详细的编辑,而这些方法都是基于文本驱动的NeRF场景编辑。
https://arxiv.org/abs/2404.04526
The rapid development of 3D acquisition technology has made it possible to obtain point clouds of real-world terrains. However, due to limitations in sensor acquisition technology or specific requirements, point clouds often contain defects such as holes with missing data. Inpainting algorithms are widely used to patch these holes. However, existing traditional inpainting algorithms rely on precise hole boundaries, which limits their ability to handle cases where the boundaries are not well-defined. On the other hand, learning-based completion methods often prioritize reconstructing the entire point cloud instead of solely focusing on hole filling. Based on the fact that real-world terrain exhibits both global smoothness and rich local detail, we propose a novel representation for terrain point clouds. This representation can help to repair the holes without clear boundaries. Specifically, it decomposes terrains into low-frequency and high-frequency components, which are represented by B-spline surfaces and relative height maps respectively. In this way, the terrain point cloud inpainting problem is transformed into a B-spline surface fitting and 2D image inpainting problem. By solving the two problems, the highly complex and irregular holes on the terrain point clouds can be well-filled, which not only satisfies the global terrain undulation but also exhibits rich geometric details. The experimental results also demonstrate the effectiveness of our method.
3D 采集技术的快速发展使得获取真实地形的三点云成为可能。然而,由于传感器采集技术的限制或具体要求,点云通常包含一些缺陷,如缺失数据导致的洞。为了修复这些洞,修复算法(inpainting algorithms)得到了广泛应用。然而,现有的传统修复算法依赖于精确的洞边界,这限制了它们在边界定义不明确的情况下的处理能力。另一方面,基于学习的修复方法通常优先重构整个点云,而不是仅仅关注洞填充。基于真实地形既表现出全局平滑性又富有局部细节的事实,我们提出了一个新颖的地形点云表示。这种表示可以帮助修复洞,而不仅仅是填充洞。具体来说,它将地形分解为低频和高频组件,分别用B-spline表面和相对高度图表示。这样,地形点云修复问题转化为B-spline表面拟合和2D图像修复问题。通过解决这两个问题,可以填充地形点云中的复杂且不规则的洞,不仅满足全局地形起伏,还展示了丰富的几何细节。实验结果也证明了我们的方法的有效性。
https://arxiv.org/abs/2404.03572
We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: this https URL
我们提出了GenN2N,一个统一的NeRF到NeRF翻译框架,用于各种NeRF翻译任务,如文本驱动的NeRF编辑、颜色化、超分辨率等。与之前为单独任务设计的具有任务特定方案的方法不同,GenN2N通过使用可插拔的图像到图像的翻译器在二维领域执行编辑并将2D编辑浮动到三维NeRF空间中,从而实现所有这些NeRF编辑任务。由于二维编辑的3D一致性可能无法保证,我们提出通过一个生成模型建模底层3D编辑的分布。为了从2D编辑图像中建模3D编辑的分布,我们仔细设计了一个VAE-GAN,它在解码NeRF的同时编码图像。隐空间通过归一化高斯分布进行训练,NeRFs通过在其渲染上应用对抗损失进行监督。为了确保隐码不依赖于2D视点,而是真正反映了3D编辑,我们还通过对比学习方案对隐码进行正则化。在各种编辑任务上的广泛实验表明,GenN2N作为一个通用框架,表现出色或者与任务特定专家相当,同时具有灵活的生成能力。更多结果请查看我们的项目页面:https:// this URL。
https://arxiv.org/abs/2404.02788
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimization problem-based methods. Our solution includes a new architecture operating in the FS latent space of StyleGAN, an enhanced inpainting approach, and improved encoders for better alignment, color transfer, and a new encoder for post-processing. The effectiveness of our approach is demonstrated on realism metrics after random hairstyle transfer and reconstruction when the original hairstyle is transferred. In the most difficult scenario of transferring both shape and color of a hairstyle from different images, our method performs in less than a second on the Nvidia V100. Our code is available at this https URL.
我们的论文解决了将发型从参考图像转移到输入照片进行虚拟试戴的复杂任务。由于需要适应各种照片姿势、发型敏感度和缺乏客观指标,这项任务非常具有挑战性。同时,目前基于优化过程的方法速度极慢,而快速编码的模型质量也非常低,因为它们要么在StyleGAN的W+空间运行,要么使用其他低维图像生成器。此外,两种方法在发型转移时都存在问题,因为它们要么完全忽略姿势,要么处理姿势不高效。在我们的论文中,我们提出了HairFast模型,它独特地解决了这些问题,并实现了高分辨率、接近实时性能和卓越的重建效果,与基于优化问题的方法相比。我们的解决方案包括一个新的在StyleGAN的FS潜在空间中运行的架构、增强的修复方法、改进的编码器以及后处理的新编码器。在随机发型转移和重建时,我们的方法在现实主义指标上证明了其有效性。在将不同图像的发型和颜色从一个图像转移到另一个图像的最困难的情况下,我们的方法在Nvidia V100上执行的时间不到一秒。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.01094
Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases.
基于图像的虚拟试穿变得越来越重要,它旨在合成特定人物穿着指定服装的图像。近年来,基于扩散模型的方法越来越受欢迎,因为它们在图像合成任务上表现出色。然而,这些方法通常需要使用额外的图像编码器,并且依赖于从服装到人物图像的跨注意机制进行纹理传递,这会影响试穿的效率和准确性。为了解决这些问题,我们提出了一个纹理保留扩散(TPD)模型进行虚拟试穿,它增强了结果的准确性,同时没有增加额外的图像编码器。从两个方面做出贡献。首先,我们提出将遮罩人员和参考服装图像沿着空间维度连接并利用结果图像作为扩散模型的去噪UNet输入,这使得扩散模型中的原始自注意力层能够实现高效且准确的纹理转移。其次,我们提出了一种新的扩散基方法,根据人员和参考服装图像预测精确的修复掩码,进一步提高了试穿结果的可靠性。此外,我们将掩码预测和图像合成集成到一个紧凑的模型中。实验结果表明,我们的方法可以应用于各种试穿任务,例如服装到人员和人员到服装的试穿,而且在流行的大型VITON和VITON-HD数据库上显著超过了最先进的方法。
https://arxiv.org/abs/2404.01089
Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views, removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video, and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions, we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain, our method uses multi-resolution neural feature planes for precise segmentation, which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics, especially in scenarios with complex real-world scenes. In particular, our approach eliminates the need for manual interaction, such as drawing motion masks by hand and additional pose estimation, making it a highly effective and efficient solution.
全景相机在各种应用中广泛使用,以提供广阔的视野。然而,由于不可避免地存在于其广角视野中的动态物体(包括摄影师)的存在,它们在合成新颖视角时面临挑战。在本文中,我们介绍了一种名为OmniLocalRF的新方法,可以将静态仅的场景视图同时消除和修复动态物体。我们的方法将局部辐射场原理与指向性光线双向优化相结合。我们的输入是一个全景视频,我们评估前后帧之间整个角度的相互观察。为了减少动态物体的幽灵像和修复修复遮挡,我们设计了一个多分辨率运动掩码预测模块。与现有的方法不同,我们使用多分辨率神经特征平面进行精确分割,这更适合于长360度视频。我们的实验证实,OmniLocalRF在质量和数量上优于现有方法,特别是在复杂现实场景中。特别是,我们的方法无需手动交互,例如通过手绘运动掩码和额外的姿态估计,这使得它成为一种高效且有效的解决方案。
https://arxiv.org/abs/2404.00676
Transformer based methods have achieved great success in image inpainting recently. However, we find that these solutions regard each pixel as a token, thus suffering from an information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration. 2) They quantize $256^3$ RGB values to a small number (such as 512) of quantized color values. The indices of quantized pixels are used as tokens for the inputs and prediction targets of the transformer. To mitigate these issues, we propose a new transformer based framework called "PUT". Specifically, to avoid input downsampling while maintaining computation efficiency, we design a patch-based auto-encoder P-VQVAE. The encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by input quantization, an Un-quantized Transformer is applied. It directly takes features from the P-VQVAE encoder as input without any quantization and only regards the quantized tokens as prediction targets. Furthermore, to make the inpainting process more controllable, we introduce semantic and structural conditions as extra guidance. Extensive experiments show that our method greatly outperforms existing transformer based methods on image fidelity and achieves much higher diversity and better fidelity than state-of-the-art pluralistic inpainting methods on complex large-scale datasets (e.g., ImageNet). Codes are available at this https URL.
基于Transformer的方法在最近在图像修复方面取得了巨大的成功。然而,我们发现这些解决方案将每个像素视为一个标记,从而从两个方面造成了信息损失:1)为了提高效率,它们将输入图像 downscale 到较低的分辨率。2)它们将256个三通道的RGB值量化为数量较小的(如512)个量化颜色值。量化像素的索引被用作Transformer输入和预测目标的数据。为了减轻这些问题,我们提出了一个名为“PUT”的新Transformer框架。具体来说,为了在保持计算效率的同时避免输入下采样,我们设计了一个基于补丁的自动编码器P-VQVAE。编码器将遮罩图像转换为非重叠的补丁标记,解码器从补丁标记中恢复被修复的区域,同时保持未修复的区域不变。为了消除输入量化引起的信息损失,应用了无量化Transformer。它直接将P-VQVAE编码器的特征作为输入,没有任何量化,并且只将量化标记视为预测目标。此外,为了使修复过程更加可控,我们引入了语义和结构条件作为额外的指导。大量实验证明,我们的方法在图像质量方面大大优于现有的Transformer based方法,在复杂的大型数据集(如ImageNet)上实现了更高的多样性和更好的可靠性。代码可在此处下载:https://url.cn/PUT
https://arxiv.org/abs/2404.00513