Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.
视频电线修复(VWI)是视频修复中的一个突出应用,旨在通过完美地去除电影或电视剧中的电线,从而节省大量的时间和劳动力。然而,由于电线比通常针对的视频修复任务中的物体更长且更细,电线 removal 带来了更大的挑战,并且经常与人员和背景物体不规则地相交,这增加了修复过程的复杂性。为了克服现有视频电线数据集中的限制,其特征是数据量小、质量差且场景有限,我们提出了一个新的 VWI 数据集,称为电线移除视频数据集 2(WRV2)和高仿电线形状(PWS)掩码。WRV2 数据集包括超过 4,000 个视频,平均长度为 80 帧,旨在促进修复模型的开发和效果。在此基础上,我们的研究提出了 Redundancy-Aware Transformer(Raformer)方法,该方法解决了视频修复中电线去除的独特挑战。与传统方法不同,Raformer 采用了一种新的策略,通过粗粒度、窗口 based 的注意力机制选择性地绕过冗余部分,例如缺乏对修复有用信息的静态背景段。Raformer 的核心是 Redundancy-Aware 注意力(RAA)模块,通过粗粒度、窗口 based 的注意力机制隔离和强调关键内容。这还由软特征对齐(SFA)模块补充,该模块细化这些特征并实现端到端特征对齐。对传统视频修复数据集和我们的 WRV2 数据集的实验证明,Raformer 优于其他最先进的治疗方法。
https://arxiv.org/abs/2404.15802
The scarcity of green spaces, in urban environments, consists a critical challenge. There are multiple adverse effects, impacting the health and well-being of the citizens. Small scale interventions, e.g. pocket parks, is a viable solution, but comes with multiple constraints, involving the design and implementation over a specific area. In this study, we harness the capabilities of generative AI for multi-scale intervention planning, focusing on nature based solutions. By leveraging image-to-image and image inpainting algorithms, we propose a methodology to address the green space deficit in urban areas. Focusing on two alleys in Thessaloniki, where greenery is lacking, we demonstrate the efficacy of our approach in visualizing NBS interventions. Our findings underscore the transformative potential of emerging technologies in shaping the future of urban intervention planning processes.
城市环境中绿色空间的稀缺是一个关键挑战。这种短缺对公民的健康和福祉产生了多种不利影响。小规模干预措施,例如口袋公园,是一种可行的解决方案,但需要考虑多个限制,包括在特定区域的设计和实施。在这项研究中,我们利用生成式人工智能的多尺度干预规划功能,重点关注基于自然的解决方案。通过利用图像到图像和图像修复算法,我们提出了一种解决城市地区绿色空间不足的方法。聚焦于希腊塞萨洛尼基的两个小巷,其中绿化不足,我们证明了我们方法在可视化基于自然的干预措施方面的有效性。我们的研究结果强调了新兴技术在塑造城市干预规划过程未来方面的变革潜力。
https://arxiv.org/abs/2404.15492
Learning-based image stitching techniques typically involve three distinct stages: registration, fusion, and rectangling. These stages are often performed sequentially, each trained independently, leading to potential cascading error propagation and complex parameter tuning challenges. In rethinking the mathematical modeling of the fusion and rectangling stages, we discovered that these processes can be effectively combined into a single, variety-intensity inpainting problem. Therefore, we propose the Simple and Robust Stitcher (SRStitcher), an efficient training-free image stitching method that merges the fusion and rectangling stages into a unified model. By employing the weighted mask and large-scale generative model, SRStitcher can solve the fusion and rectangling problems in a single inference, without additional training or fine-tuning of other models. Our method not only simplifies the stitching pipeline but also enhances fault tolerance towards misregistration errors. Extensive experiments demonstrate that SRStitcher outperforms state-of-the-art (SOTA) methods in both quantitative assessments and qualitative evaluations. The code is released at this https URL
基于学习的图像拼接技术通常包括三个不同的阶段:注册、融合和矩形化。这些阶段通常按顺序执行,每个阶段都经过独立训练,这可能导致级联错误传播和复杂参数调整挑战。在重新考虑融合和矩形化阶段的数学建模时,我们发现这些过程可以有效合并成一个单一的多样性强度在补救问题中。因此,我们提出了简单且鲁棒的全拼接器(SRStitcher)高效的无训练图像拼接方法,将融合和矩形化阶段合并为一个统一的模型。通过采用加权掩码和大规模生成模型,SRStitcher可以在单个推理中解决融合和矩形化问题,而无需其他模型的微调或训练。我们的方法不仅简化了拼接流程,还提高了对配准错误容错的能力。大量实验证明,SRStitcher在定量评估和定性评估方面都优于最先进的(SOTA)方法。代码发布在https://这一URL
https://arxiv.org/abs/2404.14951
Hyperspectral imaging (HSI) is a key technology for earth observation, surveillance, medical imaging and diagnostics, astronomy and space exploration. The conventional technology for HSI in remote sensing applications is based on the push-broom scanning approach in which the camera records the spectral image of a stripe of the scene at a time, while the image is generated by the aggregation of measurements through time. In real-world airborne and spaceborne HSI instruments, some empty stripes would appear at certain locations, because platforms do not always maintain a constant programmed attitude, or have access to accurate digital elevation maps (DEM), and the travelling track is not necessarily aligned with the hyperspectral cameras at all times. This makes the enhancement of the acquired HS images from incomplete or corrupted observations an essential task. We introduce a novel HSI inpainting algorithm here, called Hyperspectral Equivariant Imaging (Hyper-EI). Hyper-EI is a self-supervised learning-based method which does not require training on extensive datasets or access to a pre-trained model. Experimental results show that the proposed method achieves state-of-the-art inpainting performance compared to the existing methods.
hyperspectral imaging(HSI)是一种用于地球观测、监控、医学成像和诊断、天文学和太空探索的关键技术。在遥感应用中,传统的HSI技术是基于扫描方法,即相机在一次拍摄中记录场景的条带光谱图像,然后通过时间累积测量结果来生成图像。在实际的航空和航天器HSI仪器中,在某些位置会看到一些空条带,因为平台并不总是保持恒定的程序化姿态,或者无法访问准确的数字高程图(DEM),并且飞行轨迹不一定与所有时刻的 hyperspectral 相机对齐。这使得从 incomplete 或 corrupted observations 中增强已获得 HS 图像成为一个必要任务。我们在这里介绍了一种名为 Hyperpectral Equivariant Imaging(Hyper-EI)的新型HSI修复算法。Hyper-EI是一种自监督学习方法,不需要在大量数据集上进行训练或访问预训练模型。实验结果表明,与现有方法相比,所提出的方法在修复效果方面实现了最先进的水平。
https://arxiv.org/abs/2404.13159
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
我们提出了一种名为LazyDiffusion的新扩散变换器,它能够高效地生成部分图像更新。我们的方法针对交互式图像编辑应用,用户从空白的画布或图像开始,使用二进制掩码和文本提示指定一系列局部图像修改序列。生成器有两个阶段。首先,上下文编码器处理当前画布和用户掩码以生成一个紧凑的全局上下文,专门针对要生成的区域进行优化。其次,在上下文条件下,扩散基变换器解码器以“懒散”的方式合成掩码中的像素,即它只生成掩码的区域。这 contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. 我们的解码器的运行时与掩码大小成比例,而我们的编码器引入的开销可以忽略不计。我们证明了我们的方法在质量和忠实度方面与最先进的修复方法相当,同时为典型的用户交互提供10倍的加速,其中编辑掩码代表图像的10%。
https://arxiv.org/abs/2404.12382
In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at this https URL .
在这项工作中,我们研究了基于草图指导的图像修复任务。与在自然语言引导下图像修复任务中已经取得很好进展的情况不同,相对较少研究的基于草图指导的修复任务为用户指定修复物体形状和姿势提供了更大的用户控制。作为解决这个问题的一种早期解决方案,我们引入了一个新的部分离散扩散过程(PDDP)。PDDP的前向传播会破坏图像上的遮罩区域,而反向传播根据我们提出的草图指导双向变换器重构这些遮罩区域。所提出的新的变换器模块接受两个输入——包含要修复的遮罩区域的图像和查询草图,以建模反向扩散过程。这种策略有效地解决了草图和自然图像之间的领域差距,从而提高了修复结果的质量。在缺乏针对这个任务的较大规模数据集的情况下,我们通过将MS-COCO中的数据集合成一个数据集来训练,并详细评估我们提出的框架与各种有效方法之间的差异。定性和定量的结果以及用户研究证实了所提出的修复方法在给定草图的视觉外观下修复了真实的物体。为了进一步研究,我们将代码公开发布在https:// 这个网址上。
https://arxiv.org/abs/2404.11949
3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.
最近,3D高斯分布作为一种有效的新的视图合成表示方法而 emergence。本文重点研究了在修复任务上的可编辑性,该任务旨在通过为3D Gaussians增加视觉上和谐的数据点来补充不完整的3D Gaussians。与2D修复相比,修复3D Gaussians的关键在于确定引入的点的渲染相关性质,这些优化主要受益于它们初始的3D位置。为此,我们提出了一种基于图像条件完成模型的点初始化方法,该模型能够根据观察到的图像直接恢复深度图。这样的设计允许我们的模型在与原始深度对齐的尺度上填充深度值,并且还能从大面积扩散先验中充分利用强大的泛化能力。由于更准确的深度完成,我们的方法(被称为InFusion)在各种复杂场景下具有足够好的保真度和效率超越了现有的替代方法。我们进一步通过几个实际应用展示了InFusion的有效性,例如使用用户特定纹理进行修复或通过新颖物体插入进行修复。
https://arxiv.org/abs/2404.11613
Object removal refers to the process of erasing designated objects from an image while preserving the overall appearance, and it is one area where image inpainting is widely used in real-world applications. The performance of an object remover is quantitatively evaluated by measuring the quality of object removal results, similar to how the performance of an image inpainter is gauged. Current works reporting quantitative performance evaluations utilize original images as references. In this letter, to validate the current evaluation methods cannot properly evaluate the performance of an object remover, we create a dataset with object removal ground truth and compare the evaluations made by the current methods using original images to those utilizing object removal ground truth images. The disparities between two evaluation sets validate that the current methods are not suitable for measuring the performance of an object remover. Additionally, we propose new evaluation methods tailored to gauge the performance of an object remover. The proposed methods evaluate the performance through class-wise object removal results and utilize images without the target class objects as a comparison set. We confirm that the proposed methods can make judgments consistent with human evaluators in the COCO dataset, and that they can produce measurements aligning with those using object removal ground truth in the self-acquired dataset.
对象移除是指在保留图像整体外观的情况下,从图像中删除指定对象的过程,它是图像修复在现实应用中得到广泛使用的一个领域。对象移除算法的性能通过测量对象移除结果的质量进行定量评估,就像衡量图像修复性能一样。目前的工作报道了定量性能评估,它们使用原始图像作为参考。在本文中,为了验证当前评估方法不能正确评估对象移除算法的性能,我们创建了一个带有对象移除真实值的 dataset,并使用原始图像比较当前方法得到的评估结果和利用对象移除真实值图像得到的评估结果。两个评估集中的差异证实了当前方法不适合测量对象移除算法的性能。此外,我们提出了针对对象移除算法的性能评估的新方法。这些方法通过类级对象移除结果进行评估,并利用没有目标类物体作为比较集的图像。我们证实,所提出的方法可以在COCO数据集中让评估者做出一致的判断,并且可以产生与利用对象移除真实值数据集中的测量结果相一致的测量值。
https://arxiv.org/abs/2404.11104
The task of video inpainting detection is to expose the pixel-level inpainted regions within a video sequence. Existing methods usually focus on leveraging spatial and temporal inconsistencies. However, these methods typically employ fixed operations to combine spatial and temporal clues, limiting their applicability in different scenarios. In this paper, we introduce a novel Multilateral Temporal-view Pyramid Transformer ({\em MumPy}) that collaborates spatial-temporal clues flexibly. Our method utilizes a newly designed multilateral temporal-view encoder to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module to enhance the diversity of these collaborations. Subsequently, we develop a multi-pyramid decoder to aggregate the various types of features and generate detection maps. By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions. We validate our method on existing datasets and also introduce a new challenging and large-scale Video Inpainting dataset based on the YouTube-VOS dataset, which employs several more recent inpainting methods. The results demonstrate the superiority of our method in both in-domain and cross-domain evaluation scenarios.
视频修复检测的任务是揭示视频中每个视频帧的像素级修复区域。现有的方法通常利用空间和时间不一致性来结合空间和时间提示。然而,这些方法通常采用固定的操作来结合空间和时间提示,限制了它们在不同场景中的应用。在本文中,我们引入了一种名为Multilateral Temporal-view Pyramid Transformer(MUMPy)的新颖方法,它灵活地合作空间和时间提示。我们的方法利用一个新的多边形时间视图编码器来提取各种空间-时间提示的合作,并引入了一个可变的窗口基于时间视图的交互模块,以增强这些合作的变化。接下来,我们开发了一个多层金字塔解码器来聚合各种特征并生成检测图。通过调整空间和时间提示的贡献强度,我们的方法可以有效地检测修复区域。我们在现有数据集上评估了我们的方法,并还基于YouTube-VOS数据集引入了一个新的具有挑战性和大规模的视频修复检测数据集,该数据集采用了一些更先进的修复方法。结果表明,在我们的方法和跨域评估场景中,我们的方法具有优越性。
https://arxiv.org/abs/2404.11054
Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.
神经重建方法正迅速成为3D场景的首选表示方法,但它们的可编辑性仍然存在挑战。在这项工作中,我们提出了一个3D场景修复方法--在修复场景中部分内容替换所需内容。场景修复是一个本质上有问题的任务,因为存在许多合理地替换缺失内容的解决方案。因此,一个好的修复方法应该不仅能够实现高质量合成,还应该具有高程度的控制力。基于这一观察,我们专注于实现对修复内容的显式控制,并利用参考图像作为实现这一目标的有效手段。具体来说,我们引入了RefFusion,一种基于图像修复扩散模型多尺度自适应的3D修复方法。自适应有效地将先验分布适应目标场景,导致评分蒸馏目标变低,因此显著地更清晰地突出细节。我们的框架在物体去除的同时保持高可控制性。我们还进一步证明了我们在其他下游任务上的通用性,例如物体插入、场景修复和稀疏视图重建。
https://arxiv.org/abs/2404.10765
Few-shot segmentation is a task to segment objects or regions of novel classes within an image given only a few annotated examples. In the generalized setting, the task extends to segment both the base and the novel classes. The main challenge is how to train the model such that the addition of novel classes does not hurt the base classes performance, also known as catastrophic forgetting. To mitigate this issue, we use SegGPT as our base model and train it on the base classes. Then, we use separate learnable prompts to handle predictions for each novel class. To handle various object sizes which typically present in remote sensing domain, we perform patch-based prediction. To address the discontinuities along patch boundaries, we propose a patch-and-stitch technique by re-framing the problem as an image inpainting task. During inference, we also utilize image similarity search over image embeddings for prompt selection and novel class filtering to reduce false positive predictions. Based on our experiments, our proposed method boosts the weighted mIoU of a simple fine-tuned SegGPT from 15.96 to 35.08 on the validation set of few-shot OpenEarthMap dataset given in the challenge.
少样本分割是在只有几篇注释示例的情况下,对图像中 novel 类别的对象或区域进行分割的任务。在扩展设置中,任务扩展到同时分割基础类和 novel 类别。主要挑战是训练模型,使得 novel 类别的添加不会损害基础类别的性能,也就是灾难性遗忘(catastrophic forgetting)。为了减轻这个问题,我们使用 SegGPT 作为基础模型,并在其基础上进行训练。然后,我们使用独立的可学习提示来处理每个 novel 类别的预测。为了处理遥感领域中通常存在的各种对象大小,我们采用基于补丁的预测。为了处理补丁边界上的不连续性,我们提出了通过重新将问题重构为图像修复任务来解决补丁和缝合技术。在推理过程中,我们还利用图像相似搜索来选择提示和过滤 novel 类别,以降低虚假阳性预测。根据我们的实验,我们对简单微调的 SegGPT 的加权 mIoU 在 few-shot OpenEarthMap 数据集的验证集上从 15.96 提高到了 35.08。
https://arxiv.org/abs/2404.10307
Generating background scenes for salient objects plays a crucial role across various domains including creative design and e-commerce, as it enhances the presentation and context of subjects by integrating them into tailored environments. Background generation can be framed as a task of text-conditioned outpainting, where the goal is to extend image content beyond a salient object's boundaries on a blank background. Although popular diffusion models for text-guided inpainting can also be used for outpainting by mask inversion, they are trained to fill in missing parts of an image rather than to place an object into a scene. Consequently, when used for background creation, inpainting models frequently extend the salient object's boundaries and thereby change the object's identity, which is a phenomenon we call "object expansion." This paper introduces a model for adapting inpainting diffusion models to the salient object outpainting task using Stable Diffusion and ControlNet architectures. We present a series of qualitative and quantitative results across models and datasets, including a newly proposed metric to measure object expansion that does not require any human labeling. Compared to Stable Diffusion 2.0 Inpainting, our proposed approach reduces object expansion by 3.6x on average with no degradation in standard visual metrics across multiple datasets.
在包括创意设计和电子商务在内的各个领域中,生成突出物场景对突出物在场景中的表现和上下文至关重要。通过将对象整合到定制环境中,可以增强主题的表现和上下文。生成背景的过程可以看作是一个文本条件下的修复绘画任务,其目标是将图像内容扩展到突出物的边界之外。尽管引导文本修复绘图模型(例如)也可以通过遮罩反向填充进行修复,但它们通过填充图像的缺失部分来修复图像,而不是将物体放入场景中。因此,当用于背景生成时,修复绘图模型经常扩展突出物的边界,从而改变物体的身份,这种现象我们称之为“物体膨胀”。本文介绍了一个使用Stable Diffusion和ControlNet架构将修复扩散模型适应突出物修复任务的模型。我们在模型和数据集上展示了的一系列定性和定量结果,包括一个不需要任何人类标注的新指标来衡量物体膨胀。与Stable Diffusion 2.0修复绘图相比,我们提出的方法在多个数据集上的标准视觉指标上减少了3.6倍的物体膨胀。
https://arxiv.org/abs/2404.10157
Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: this https URL
Neural Radiance Field(NeRF)是从多视角图像的三维重建表示。尽管一些最近的工作在编辑重新构建的NeRF时表明初步成功,但它们仍然难以在完全未覆盖的区域中合成合理的几何形状。一个主要原因是从扩散模型中合成内容的多样性,这阻碍了辐射场收敛到清晰和确定性几何。此外,在应用拉文迪格模型的真实数据时,由于自编码错误,通常会导致图像条件下的纹理平滑度转移。这些问题进一步得到像素距离损失的加剧。为了解决这些问题,我们通过每个场景的定制来调节扩散模型的随机性,并通过掩码对抗训练来减轻纹理平滑。在分析过程中,我们还发现,在NeRF修复任务中,通常使用的像素和感知损失是有害的。通过严谨的实验,我们的框架在各种真实世界场景中产生了最先进的NeRF修复结果。项目页面:https:// this URL
https://arxiv.org/abs/2404.09995
Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
准确地完成和去噪屋顶高度图对于重建高质量的3D建筑至关重要。修复稀疏点可以提高低成本传感器使用并减少无人机飞行重叠。RoofDiffusion是一种新的端到端自监督扩散技术,特别适用于完成艰难、高度不连续的屋顶高度图。RoofDiffusion利用广泛可用的心跳图,可以处理多达99%的点稀疏和80%的屋顶面积遮挡(区域不完整性)。一种变体,No-FP RoofDiffusion同时预测建筑轮廓和高度。在屋顶特定基准和BuildingNet数据集上,No-FP RoofDiffusion的定量效果超过了目前最先进的未经指导的深度完成和代表性的修复方法。定性评估显示,RoofDiffusion对于包括AHN3、Dales3D和USGS 3DEP LiDAR等现实世界扫描的数据集非常有效。使用领先的City3D算法进行测试,使用RoofDiffusion预处理屋顶图显著提高了3D建筑重建。RoofDiffusion通过一个新的具有13k个复杂屋顶几何的 datasets,重点关注遥感中的长尾问题;一种新的树遮挡模拟;以及各种大面积屋顶切口,用于数据增强和基准测试而得到了补充。
https://arxiv.org/abs/2404.09290
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
我们介绍了一种从文本描述生成通用面向未来的3D场景的技术,名为RealmDreamer。我们的技术通过优化3D高斯平铺表示来匹配复杂的文本提示。我们通过利用最先进的文本到图像生成器的状态,将样本提升到3D并计算遮挡体积。然后,在多个视角上对这种表示进行优化,将其作为3D修复任务与图像条件扩散模型一起进行。为了学习正确的几何结构,我们在修复模型上通过条件于修复模型的样本,从而赋予了丰富几何结构。最后,我们使用增强的生成器样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视角数据,可以生成不同风格的高质量3D场景,包括多个物体。此外,其普遍性还允许从单张图像合成3D。
https://arxiv.org/abs/2404.07199
Fractional Brownian trajectories (fBm) feature both randomness and strong scale-free correlations, challenging generative models to reproduce the intrinsic memory characterizing the underlying process. Here we test a diffusion probabilistic model on a specific dataset of corrupted images corresponding to incomplete Euclidean distance matrices of fBm at various memory exponents $H$. Our dataset implies uniqueness of the data imputation in the regime of low missing ratio, where the remaining partial graph is rigid, providing the ground truth for the inpainting. We find that the conditional diffusion generation stably reproduces the statistics of missing fBm-distributed distances for different values of $H$ exponent. Furthermore, while diffusion models have been recently shown to remember samples from the training database, we show that diffusion-based inpainting behaves qualitatively different from the database search with the increasing database size. Finally, we apply our fBm-trained diffusion model with $H=1/3$ for completion of chromosome distance matrices obtained in single-cell microscopy experiments, showing its superiority over the standard bioinformatics algorithms. Our source code is available on GitHub at this https URL.
分式布朗轨迹(fBm)具有随机性和强标度无关性,挑战生成模型对底层过程的固有记忆特征进行还原。在这里,我们在具有不同记忆指数$H$的完整欧氏距离矩阵的污损图像的特定数据集上对扩散概率模型进行测试。我们的数据集表明,在残差比低的情况下,数据缺失的鲁棒性是唯一的,而剩余的离散图是刚性的,为修复提供真实值。我们发现,条件扩散生成稳定地还原了不同$H$值下污损fBm分布的统计量。此外,扩散模型最近已经被证明可以从训练数据库中记住样本,但我们发现,随着数据库大小的增加,基于扩散的修复表现出与数据库搜索 qualitatively different的行为。最后,我们将$H=\frac{1}{3}$应用于从单细胞显微镜实验中获得的染色体距离矩阵的修复,证明了其优越性 over 标准生物信息学算法。我们的源代码可在此https URL上获取。
https://arxiv.org/abs/2404.07029
Diffusion models have shown remarkable results for image generation, editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions, i.e., signed distance function and occupancy function. However, they are limited to shapes with closed surfaces, which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation, which produces a compact representation space for UDF generation. Specifically, instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss, we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: this https URL.
扩散模型在图像生成、编辑和修复方面已经取得了显著的成果。最近的工作探索了使用神经隐函数(即距离函数和占有函数)进行3D形状生成的扩散模型。然而,它们仅限于具有封闭表面的形状,这使得它们无法生成包含开放表面的多样3D真实内容。在本文中,我们提出了UDiFF,一种用于无符号距离场(UDFs)的3D扩散模型,可以从文本条件或无条件情况下生成带有开放表面的纹理3D形状。我们的关键想法是在空间-频域中通过最优小波变换生成UDFs,这产生了一个紧凑的UDF生成表示空间。具体来说,我们提出了一个数据驱动的方法来学习最优的小波变换,这需要昂贵的手动努力,并且仍然会导致大量信息损失。我们通过与最新方法在广泛使用的基准上的数值和视觉比较来评估UDiFF,以展示我们的优势。页面链接:https://this URL。
https://arxiv.org/abs/2404.06851
We propose ZeST, a method for zero-shot material transfer to an object in the input image given a material exemplar image. ZeST leverages existing diffusion adapters to extract implicit material representation from the exemplar image. This representation is used to transfer the material using pre-trained inpainting diffusion model on the object in the input image using depth estimates as geometry cue and grayscale object shading as illumination cues. The method works on real images without any training resulting a zero-shot approach. Both qualitative and quantitative results on real and synthetic datasets demonstrate that ZeST outputs photorealistic images with transferred materials. We also show the application of ZeST to perform multiple edits and robust material assignment under different illuminations. Project Page: this https URL
我们提出了ZeST,一种在给定材料示例图像的输入图像中实现零散材料传输的方法。ZeST利用现有的扩散适配器从示例图像中提取隐含材料表示。这个表示用于在输入图像中的物体上使用预训练的修复扩散模型进行材料转移。该方法对真实图像进行处理,没有任何训练,实现零散传输。在真实和合成数据集上,ZeST生成的图像具有转移的材料。我们还展示了ZeST在不同的光照条件下进行多个编辑和鲁棒材料分配的应用。项目页面:https:// this URL
https://arxiv.org/abs/2404.06425
In Generalized Category Discovery (GCD), we cluster unlabeled samples of known and novel classes, leveraging a training dataset of known classes. A salient challenge arises due to domain shifts between these datasets. To address this, we present a novel setting: Across Domain Generalized Category Discovery (AD-GCD) and bring forth CDAD-NET (Class Discoverer Across Domains) as a remedy. CDAD-NET is architected to synchronize potential known class samples across both the labeled (source) and unlabeled (target) datasets, while emphasizing the distinct categorization of the target data. To facilitate this, we propose an entropy-driven adversarial learning strategy that accounts for the distance distributions of target samples relative to source-domain class prototypes. Parallelly, the discriminative nature of the shared space is upheld through a fusion of three metric learning objectives. In the source domain, our focus is on refining the proximity between samples and their affiliated class prototypes, while in the target domain, we integrate a neighborhood-centric contrastive learning mechanism, enriched with an adept neighborsmining approach. To further accentuate the nuanced feature interrelation among semantically aligned images, we champion the concept of conditional image inpainting, underscoring the premise that semantically analogous images prove more efficacious to the task than their disjointed counterparts. Experimentally, CDAD-NET eclipses existing literature with a performance increment of 8-15% on three AD-GCD benchmarks we present.
在通用类别发现(GCD)中,我们通过利用已知类别的训练数据对已知和未知的类别进行聚类。由于这些数据之间的领域转移,一个显著的挑战出现了。为了应对这个问题,我们提出了一个新场景:跨领域通用类别发现(AD-GCD)和类域发现器跨领域(CDAD-NET)作为解决方法。CDAD-NET旨在同步已知类别的样本在已知类别的(源)和未知的(目标)数据集中的潜在样本,同时强调目标数据的独特分类。为了促进这一目标,我们提出了一个熵驱动的对抗学习策略,考虑了目标样本与源领域类别原型之间的距离分布。同时,通过融合三个度量学习目标来维持共享空间判别性的特征。在源领域,我们的关注点是改进样本与相关类别原型之间的接近程度,而在目标领域,我们引入了一种以邻域为中心的对比学习机制,并支持一个智能邻居挖掘方法。为了进一步强调语义对齐图像之间细微特征之间的关联,我们倡导条件图像修复这一概念,强调语义类似于图像证明比它们的离散对应物更有效地完成任务。在实验中,CDAD-NET在三个AD-GCD基准测试上的性能提高了8-15%。
https://arxiv.org/abs/2404.05366