We introduce an innovative deep learning-based method that uses a denoising diffusion-based model to translate low-resolution images to high-resolution ones from different optical sensors while preserving the contents and avoiding undesired artifacts. The proposed method is trained and tested on a large and diverse data set of paired Sentinel-II and Planet Dove images. We show that it can solve serious image generation issues observed when the popular classifier-free guided Denoising Diffusion Implicit Model (DDIM) framework is used in the task of Image-to-Image Translation of multi-sensor optical remote sensing images and that it can generate large images with highly consistent patches, both in colors and in features. Moreover, we demonstrate how our method improves heterogeneous change detection results in two urban areas: Beirut, Lebanon, and Austin, USA. Our contributions are: i) a new training and testing algorithm based on denoising diffusion models for optical image translation; ii) a comprehensive image quality evaluation and ablation study; iii) a comparison with the classifier-free guided DDIM framework; and iv) change detection experiments on heterogeneous data.
我们提出了一个基于去噪扩散模型的创新深度学习方法,用于从不同光学传感器将低分辨率图像翻译为高分辨率图像,同时保留内容并避免不必要的伪影。所提出的方法在大量和多样化的数据集上进行训练和测试,对成对的Sentinel-II和Planet Dove图像进行图像到图像翻译。我们证明了它可以解决当使用流行的无指导去噪扩散隐式模型(DDIM)框架在多传感器光学遥感图像图像到图像转换任务中观察到的严重图像生成问题,并且它可以生成具有高度一致性的大图像,无论是颜色还是特征。此外,我们还证明了我们的方法在两个城市地区改善了异质变化检测结果:黎巴嫩的贝鲁特和美国的奥斯汀。我们的贡献是:i)基于去噪扩散模型的新的训练和测试算法;ii)全面图像质量评估和消融研究;iii)与无分类指导的DDIM框架的比较;和iv)在异质数据上的变化检测实验。
https://arxiv.org/abs/2404.11243
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at this https URL.
我们提出了Tesmo,一种基于去噪扩散模型的文本控制场景感知运动生成方法。之前的方法主要关注单独的文本到运动,而忽略了场景,因为包含运动、文本描述和交互场景的数据集有限。我们的方法从预训练一个场景无关的文本到运动扩散模型开始,强调在大型运动捕捉数据集中实现目标达成的约束。然后,我们通过使用数据增强来微调这个模型,包括地面平面和对象的形状。为了方便训练,我们在场景中嵌入带注释的导航和交互运动。所提出的方法在不同场景和各种物体形状、姿态和姿势下,产生了真实和多样的人机交互,如导航和坐着。大量实验证明,我们的方法在人类场景交互的可信度、真实感和生成运动的多样性方面超越了先前的技术。代码将在本文发表后在https://这个 URL上发布。
https://arxiv.org/abs/2404.10685
Images captured from the real world are often affected by different types of noise, which can significantly impact the performance of Computer Vision systems and the quality of visual data. This study presents a novel approach for defect detection in casting product noisy images, specifically focusing on submersible pump impellers. The methodology involves utilizing deep learning models such as VGG16, InceptionV3, and other models in both the spatial and frequency domains to identify noise types and defect status. The research process begins with preprocessing images, followed by applying denoising techniques tailored to specific noise categories. The goal is to enhance the accuracy and robustness of defect detection by integrating noise detection and denoising into the classification pipeline. The study achieved remarkable results using VGG16 for noise type classification in the frequency domain, achieving an accuracy of over 99%. Removal of salt and pepper noise resulted in an average SSIM of 87.9, while Gaussian noise removal had an average SSIM of 64.0, and periodic noise removal yielded an average SSIM of 81.6. This comprehensive approach showcases the effectiveness of the deep AutoEncoder model and median filter, for denoising strategies in real-world industrial applications. Finally, our study reports significant improvements in binary classification accuracy for defect detection compared to previous methods. For the VGG16 classifier, accuracy increased from 94.6% to 97.0%, demonstrating the effectiveness of the proposed noise detection and denoising approach. Similarly, for the InceptionV3 classifier, accuracy improved from 84.7% to 90.0%, further validating the benefits of integrating noise analysis into the classification pipeline.
现实世界中的图像通常受到各种类型的噪声的影响,这可能会显著影响计算机视觉系统和视觉数据的质量。这项研究提出了一种新的方法来检测铸件产品噪声图像中的缺陷,特别关注潜水泵叶轮。该方法涉及利用像VGG16、InceptionV3等这样的深度学习模型在空间和频域中识别噪声类型和缺陷状态。研究过程从预处理图像开始,然后应用特定噪声类别的去噪技术。通过将噪声检测和去噪整合到分类管道中,旨在提高缺陷检测的准确性和鲁棒性。使用VGG16在频域对噪声类型分类,获得了超过99%的准确率。消除盐和胡椒噪声平均SSIM为87.9,高斯噪声消除平均SSIM为64.0,周期性噪声消除平均SSIM为81.6。这种全面的方法突出了在现实工业应用中使用深度自编码器模型和均值滤波器进行去噪策略的有效性。最后,我们的研究报道了与以前方法相比,缺陷检测二分类准确率的重大改进。对于VGG16分类器,准确率从94.6%增加到97.0%,表明所提出的噪声检测和去噪方法的有效性。同样,对于InceptionV3分类器,准确率从84.7%增加到90.0%,进一步证实了将噪声分析整合到分类管道中的好处。
https://arxiv.org/abs/2404.10664
Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks. Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images and a lack of effective out-of-domain generalization capabilities in training methods. Image generation methods represented by diffusion model provide strong priors for visual tasks and have been proven to be effectively applied to image restoration tasks. Leveraging the image priors of the Stable Diffusion (SD) model, we achieve omnidirectional image super-resolution with both fidelity and realness, dubbed as OmniSSR. Firstly, we transform the equirectangular projection (ERP) images into tangent projection (TP) images, whose distribution approximates the planar image domain. Then, we use SD to iteratively sample initial high-resolution results. At each denoising iteration, we further correct and update the initial results using the proposed Octadecaplex Tangent Information Interaction (OTII) and Gradient Decomposition (GD) technique to ensure better consistency. Finally, the TP images are transformed back to obtain the final high-resolution results. Our method is zero-shot, requiring no training or fine-tuning. Experiments of our method on two benchmark datasets demonstrate the effectiveness of our proposed method.
定向图像(ODIs)通常用于现实世界的视觉任务,而高分辨率ODIs有助于提高相关视觉任务的性能。大多数现有的超分辨率方法ODIs都使用端到端学习策略,导致生成的图像的现实性较差,训练方法中缺乏有效的跨域通用能力。代表扩散模型的图像生成方法具有很强的对视觉任务的优先级,已经被证明有效地应用于图像修复任务。通过利用Stable Diffusion(SD)模型的图像先验,我们实现了一种既有保真度又有真实感的 omnidirectional 图像超分辨率,被称为OmniSSR。首先,我们将等角投影(ERP)图像转换为切线投影(TP)图像,其分布近似于平面图像域。然后,我们使用SD逐迭代采样初始高分辨率结果。在每一次去噪迭代中,我们进一步使用所提出的八面体切线信息交互(OTII)和梯度分解(GD)技术纠正和更新初始结果,确保更好的一致性。最后,TP图像转换为获得最终高分辨率结果。我们的方法是零散的,不需要训练或微调。在两个基准数据集上的实验表明,我们提出的方法的有效性。
https://arxiv.org/abs/2404.10312
Text-to-image diffusion models benefit artists with high-quality image generation. Yet its stochastic nature prevent artists from creating consistent images of the same character. Existing methods try to tackle this challenge and generate consistent content in various ways. However, they either depend on external data or require expensive tuning of the diffusion model. For this issue, we argue that a lightweight but intricate guidance is enough to function. Aiming at this, we lead the way to formalize the objective of consistent generation, derive a clustering-based score function and propose a novel paradigm, OneActor. We design a cluster-conditioned model which incorporates posterior samples to guide the denoising trajectories towards the target cluster. To overcome the overfitting challenge shared by one-shot tuning pipelines, we devise auxiliary components to simultaneously augment the tuning and regulate the inference. This technique is later verified to significantly enhance the content diversity of generated images. Comprehensive experiments show that our method outperforms a variety of baselines with satisfactory character consistency, superior prompt conformity as well as high image quality. And our method is at least 4 times faster than tuning-based baselines. Furthermore, to our best knowledge, we first prove that the semantic space has the same interpolation property as the latent space dose. This property can serve as another promising tool for fine generation control.
文本到图像扩散模型为高质量图像生成艺术家带来了好处。然而,其随机性阻止了艺术家创建相同角色的 consistent 图像。现有方法试图解决这个挑战,并以各种方式生成一致的内容。然而,它们要么依赖于外部数据,要么需要对扩散模型进行昂贵的调整。针对这个问题,我们认为轻量但复杂指导就足够了。为了实现这个目标,我们提出了一个名为 OneActor 的全新范式。我们设计了一个包含后验样本的聚类条件模型,该模型通过引导去噪轨迹朝向目标聚类来指导模糊化过程。为了克服一阶调整管道共享的过拟合挑战,我们设计了一些辅助组件来同时增强调整和控制推理过程。这种技术后来被证明可以显著增强生成图像的内容多样性。综合实验证明,我们的方法在具有满意的字符一致性、卓越的提示符合性以及高质量图像的基础上优于各种基线。而且,据我们所知,我们的方法至少是调整基线的 4 倍快。此外,据我们最好了解,我们首先证明语义空间与潜在空间具有相同的平滑特性。这种特性可以作为另一种改进生成控制的有前景的工具。
https://arxiv.org/abs/2404.10267
Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. The code for D4RD will be made available for further exploration and adoption.
近年来,基于扩散的深度估计方法因其优雅的去噪模式和对潜在性能的承诺而引起了广泛关注。然而,在现实场景中普遍存在的恶劣条件下,例如雨天、雪天等,这些方法通常不可靠。在本文中,我们提出了一个名为D4RD的新颖鲁棒深度估计方法,它针对扩散模型在复杂环境中的性能退化问题,采用自定义的对比学习模式,通过将知识蒸馏的力量融入对比学习,构建了`三体`对比方案。具体来说,我们将前扩散过程的随机噪声作为自然参考,指导不同场景中预测噪声向更加稳定和精确的最优解方向发展。此外,我们还扩展了噪音级别三体,涵盖更一般特征和图像级别,建立了多级对比,以在整个网络中分配稳健感知的负担。在解决复杂场景之前,我们通过三种简单而有效的改进来增强基线扩散模型的稳定性,促进收敛并消除深度异常。大量实验证明,D4RD在合成污染数据集和现实天气条件上超越了现有最先进的解决方案。D4RD的代码将供进一步探索和研究。
https://arxiv.org/abs/2404.09831
Effectively discerning spatial-spectral dependencies in HSI denoising is crucial, but prevailing methods using convolution or transformers still face computational efficiency limitations. Recently, the emerging Selective State Space Model(Mamba) has risen with its nearly linear computational complexity in processing natural language sequences, which inspired us to explore its potential in handling long spectral sequences. In this paper, we propose HSIDMamba(HSDM), tailored to exploit the linear complexity for effectively capturing spatial-spectral dependencies in HSI denoising. In particular, HSDM comprises multiple Hyperspectral Continuous Scan Blocks, incorporating BCSM(Bidirectional Continuous Scanning Mechanism), scale residual, and spectral attention mechanisms to enhance the capture of long-range and local spatial-spectral information. BCSM strengthens spatial-spectral interactions by linking forward and backward scans and enhancing information from eight directions through SSM, significantly enhancing the perceptual capability of HSDM and improving denoising performance more effectively. Extensive evaluations against HSI denoising benchmarks validate the superior performance of HSDM, achieving state-of-the-art results in performance and surpassing the efficiency of the latest transformer architectures by $30\%$.
有效地区分HSI去噪中的空间-频谱依赖关系至关重要,但使用卷积或Transformer的方法仍然存在计算效率限制。最近,随着Selective State Space Model(Mamba)的出现,它以其近线性计算复杂度在处理自然语言序列方面得到了提升,这激发了我们探索其在处理长时谱序列方面的潜在能力的兴趣。在本文中,我们提出了HSIDMamba(HSDM),专门用于有效地捕捉HSI去噪中的空间-频谱依赖关系。 具体来说,HSDM由多个超光谱连续扫描块组成,包括双向连续扫描机制(BCSM)、扩展残差和谱注意机制,以增强对长距离和局部空间-频谱信息的捕捉。BCSM通过将前向和后向扫描连接起来,增强了空间-频谱交互,显著提高了HSDM的感知能力,并有效地改善了去噪效果。 通过对HSI去噪基准测试的广泛评估证实了HSDM的优越性能,达到了与最先进的Transformer架构相同的性能水平,并比其效率高30%。
https://arxiv.org/abs/2404.09697
We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, compared to its competitors, e.g., Painter and PromptDiffusion, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.
我们提出了In-Context Translation(ICT)学习框架,这是一个通用的学习框架,旨在统一视觉识别(例如语义分割)、低级图像处理(例如去噪)和条件图像生成(例如边缘到图像合成)。由于统一,ICT显著减少了为特定任务设计模型所固有的归纳偏差,并最大化了具有类似任务的相互增强。然而,在大量任务上的统一并非易事,由于各种数据格式和训练途径。为此,ICT引入了两种设计。首先,它将不同任务的输入输出数据标准化为RGB图像对,例如语义分割数据的输入和分割掩码在同一RGB格式下。这使得不同任务成为两个RGB图像之间的通用翻译任务。其次,它将不同任务的训练标准化为通用In-Context学习,其中“In-Context”意味着输入包括目标任务的实例输入-输出对和查询图像。学习目标是要生成与查询图像相配对的“缺失”数据。因此,隐式翻译过程在查询和生成图像之间。在实验中,ICT统一了十个视觉任务,并在其各自的基准测试中展示了令人印象深刻的性能。值得注意的是,与竞争对手,如Painter和PromptDiffusion相比,ICT仅在4个RTX 3090 GPU上训练,就被证明在训练过程中更有效且成本更低。
https://arxiv.org/abs/2404.09633
Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct this http URL this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.
低剂量CT(LDCT)已成为诊断医学成像的首选技术,尽管其与标准CT相比辐射剂量较低,但图像噪声增加,可能会影响诊断准确性。为解决这个问题,已经开发了高级基于深度学习的LDCT去噪算法,主要使用卷积神经网络(CNN)或具有Unet架构的Transformer网络。这种架构通过级联密集跳过路径将编码器和解码器的特征图进行集成,从而增强图像细节。然而,目前的方法通常忽视了Unet架构本身的增长,而是专注于优化编码器和解码器结构。这种方法由于编码器和解码器之间特征图特征的显著差异而具有问题,简单的融合策略可能无法有效地重构本文中的这个URL。我们引入了WiTUnet,一种新颖的LDCT图像去噪方法,它利用嵌套的密集跳过路径而不是传统的跳过连接来提高特征集成。WiTUnet还引入了一个窗口化的Transformer结构来处理更小的、非重叠的图像段,降低计算负载。此外,在编码器和解码器中引入了Local Image Perception Enhancement(LiPe)模块,用换位图像感知增强代替了Transformer中的标准多层感知器,增强了局部特征捕捉和表示。通过广泛的实验比较,WiTUnet在关键指标如峰值信号-噪声比(PSNR)、结构相似性(SSIM)和根均方误差(RMSE)方面已经表现出优越的性能,显著提高了去噪效果和图像质量。
https://arxiv.org/abs/2404.09533
Diffusion models have emerged as preeminent contenders in the realm of generative models. Distinguished by their distinctive sequential generative processes, characterized by hundreds or even thousands of timesteps, diffusion models progressively reconstruct images from pure Gaussian noise, with each timestep necessitating full inference of the entire model. However, the substantial computational demands inherent to these models present challenges for deployment, quantization is thus widely used to lower the bit-width for reducing the storage and computing overheads. Current quantization methodologies primarily focus on model-side optimization, disregarding the temporal dimension, such as the length of the timestep sequence, thereby allowing redundant timesteps to continue consuming computational resources, leaving substantial scope for accelerating the generative process. In this paper, we introduce TMPQ-DM, which jointly optimizes timestep reduction and quantization to achieve a superior performance-efficiency trade-off, addressing both temporal and model optimization aspects. For timestep reduction, we devise a non-uniform grouping scheme tailored to the non-uniform nature of the denoising process, thereby mitigating the explosive combinations of timesteps. In terms of quantization, we adopt a fine-grained layer-wise approach to allocate varying bit-widths to different layers based on their respective contributions to the final generative performance, thus rectifying performance degradation observed in prior studies. To expedite the evaluation of fine-grained quantization, we further devise a super-network to serve as a precision solver by leveraging shared quantization results. These two design components are seamlessly integrated within our framework, enabling rapid joint exploration of the exponentially large decision space via a gradient-free evolutionary search algorithm.
扩散模型已成为生成模型领域的主要竞争者。通过其独特的序列生成过程脱颖而出,这些过程具有数百甚至数千个时步,扩散模型从纯高斯噪声中逐渐重构图像,每个时步都需要对整个模型进行完整的推理。然而,这些模型固有的计算需求在面对部署方面具有挑战性,因此广泛使用量化来降低位宽以减少存储和计算开销。目前,量化方法主要关注模型侧优化,而忽略了时域维度,例如时步序列的长度,从而允许冗余时步继续消耗计算资源,为加速生成过程留下了广阔的余地。在本文中,我们引入了TMPQ-DM,该模型通过共同优化时步减少和量化来实现卓越的性能-效率权衡,解决了时域和模型优化方面的问题。在时步减少方面,我们设计了一个非均匀分组方案,针对去噪过程非均匀性的特点,从而减轻了时步的爆炸组合。在量化方面,我们采用了一种细粒度的层-wise方法,根据各个层对最终生成性能的贡献分配不同的位宽,从而纠正了之前研究中观察到的性能下降。为了加速细粒度量化评估,我们进一步设计了一个超网络,利用共享量化结果作为精度求解器。这两个设计组件无缝地整合在我们的框架中,通过梯度free进化搜索算法快速探索具有指数级大决策空间。
https://arxiv.org/abs/2404.09532
We introduce a novel approach to single image denoising based on the Blind Spot Denoising principle, which we call MAsked and SHuffled Blind Spot Denoising (MASH). We focus on the case of correlated noise, which often plagues real images. MASH is the result of a careful analysis to determine the relationships between the level of blindness (masking) of the input and the (unknown) noise correlation. Moreover, we introduce a shuffling technique to weaken the local correlation of noise, which in turn yields an additional denoising performance improvement. We evaluate MASH via extensive experiments on real-world noisy image datasets. We demonstrate on par or better results compared to existing self-supervised denoising methods.
我们提出了一种基于盲目点消除原理的新单图去噪方法,我们称之为Masked and Shuffled Blind Spot Denoising(MASH)。我们关注相关噪声在真实图像中的情况。MASH是通过对输入盲目程度(遮盖水平)与(未知)噪声相关性的仔细分析来确定的结果。此外,我们还引入了一种随机化技术来削弱噪声的局部相关性,从而进一步提高去噪效果。我们对现实世界中的嘈杂图像数据集进行广泛的实验评估。我们证明了MASH与现有自监督去噪方法的性能相当或者更好。
https://arxiv.org/abs/2404.09389
Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
准确地完成和去噪屋顶高度图对于重建高质量的3D建筑至关重要。修复稀疏点可以提高低成本传感器使用并减少无人机飞行重叠。RoofDiffusion是一种新的端到端自监督扩散技术,特别适用于完成艰难、高度不连续的屋顶高度图。RoofDiffusion利用广泛可用的心跳图,可以处理多达99%的点稀疏和80%的屋顶面积遮挡(区域不完整性)。一种变体,No-FP RoofDiffusion同时预测建筑轮廓和高度。在屋顶特定基准和BuildingNet数据集上,No-FP RoofDiffusion的定量效果超过了目前最先进的未经指导的深度完成和代表性的修复方法。定性评估显示,RoofDiffusion对于包括AHN3、Dales3D和USGS 3DEP LiDAR等现实世界扫描的数据集非常有效。使用领先的City3D算法进行测试,使用RoofDiffusion预处理屋顶图显著提高了3D建筑重建。RoofDiffusion通过一个新的具有13k个复杂屋顶几何的 datasets,重点关注遥感中的长尾问题;一种新的树遮挡模拟;以及各种大面积屋顶切口,用于数据增强和基准测试而得到了补充。
https://arxiv.org/abs/2404.09290
This paper presents a proof-of-concept approach for learned synergistic reconstruction of medical images using multi-branch generative models. Leveraging variational autoencoders (VAEs) and generative adversarial networks (GANs), our models learn from pairs of images simultaneously, enabling effective denoising and reconstruction. Synergistic image reconstruction is achieved by incorporating the trained models in a regularizer that evaluates the distance between the images and the model, in a similar fashion to multichannel dictionary learning (DiL). We demonstrate the efficacy of our approach on both Modified National Institute of Standards and Technology (MNIST) and positron emission tomography (PET)/computed tomography (CT) datasets, showcasing improved image quality and information sharing between modalities. Despite challenges such as patch decomposition and model limitations, our results underscore the potential of generative models for enhancing medical imaging reconstruction.
本文提出了一种利用多分支生成模型进行学习医疗图像协同重构的证明概念方法。通过结合变分自编码器(VAEs)和生成对抗网络(GANs),我们的模型同时学习来自成对图像的距离,实现有效的去噪和重构。协同图像重构通过将训练好的模型集成到评价器中,该评价器评估图像与模型之间的距离,类似于多通道字典学习(DiL)的方式来实现。我们在MNIST和PET/CT数据集上证明了我们方法的有效性,展示了改善的图像质量和模态之间的信息共享。尽管存在诸如补丁分解和模型限制等挑战,但我们的结果强调生成模型的增强医疗图像重构潜力。
https://arxiv.org/abs/2404.08748
Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
将扩散模型融入图像压缩领域,有望产生真实和详细的重构,尤其是在极低比特率的情况下。以前的方法侧重于使用扩散模型作为具有条件信号量化误差稳健的表达编码器,然而以这种方式实现竞争力的结果需要对扩散模型进行昂贵的训练,并由于递归生成过程,导致推理时间较长。在这项工作中,我们将量化误差消除视为去噪任务,利用扩散来恢复在传输图像潜在中丢失的信息。我们的方法允许我们执行不到10%的完整扩散生成过程,并且不需要对扩散模型进行架构更改,使得基础模型可以作为强大的先验,无需额外对骨干模型进行微调。我们提出的编码在量化现实指标上优于以前的方法,而且我们验证,即使其他方法使用两倍的比特率,我们的重构仍然具有用户满意的质量。
https://arxiv.org/abs/2404.08580
Despite the significant progress in image denoising, it is still challenging to restore fine-scale details while removing noise, especially in extremely low-light environments. Leveraging near-infrared (NIR) images to assist visible RGB image denoising shows the potential to address this issue, becoming a promising technology. Nonetheless, existing works still struggle with taking advantage of NIR information effectively for real-world image denoising, due to the content inconsistency between NIR-RGB images and the scarcity of real-world paired datasets. To alleviate the problem, we propose an efficient Selective Fusion Module (SFM), which can be plug-and-played into the advanced denoising networks to merge the deep NIR-RGB features. Specifically, we sequentially perform the global and local modulation for NIR and RGB features, and then integrate the two modulated features. Furthermore, we present a Real-world NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse scenarios as well as various noise levels. Extensive experiments on both synthetic and our real-world datasets demonstrate that the proposed method achieves better results than state-of-the-art ones. The dataset, codes, and pre-trained models will be publicly available at this https URL.
尽管在图像去噪方面取得了显著的进展,但在消除噪声的同时恢复细粒度细节仍然具有挑战性,特别是在极低光线下。利用近红外(NIR)图像辅助可见的RGB图像去噪显示出解决这个问题的潜力,成为了一个有前景的技术。然而,现有的工作仍然很难充分利用NIR信息进行真实世界图像去噪,原因是NIR-RGB图像的内容不一致以及真实世界配对数据集的稀疏性。为了解决这个问题,我们提出了一个有效的选择性融合模块(SFM),可以将其插入到高级去噪网络中,合并深NIR-RGB特征。具体来说,我们依次对NIR和RGB特征进行全局和局部调制,然后将两个调制特征集成在一起。此外,我们还提出了一个真实世界NIR辅助图像去噪(Real-NAID)数据集,涵盖了各种场景以及各种噪声水平。对合成和真实世界数据集的广泛实验证明,与最先进的去噪方法相比,所提出的方法具有更好的效果。数据集、代码和预训练模型将公开发布在https://这个URL上。
https://arxiv.org/abs/2404.08514
Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at this https URL.
生成模型,例如Stable Diffusion,已经使从文本提示创建逼真的图像成为可能。然而,从文本生成360度全景图像仍然具有挑战性,特别是由于缺乏成对文本全景数据和全景与透视图像之间的领域差,使得该任务更加困难。在本文中,我们介绍了一种名为PanFusion的新型双分支扩散模型,用于从文本提示生成360度图像。我们利用稳定扩散模型作为一臂,提供自然图像生成方面的先验知识,并将其注册到另一臂全景分支上,以实现整体图像生成。我们提出了一种独特的跨注意机制,具有投影意识,用于在合作去噪过程中最小化扭曲。我们的实验验证,PanFusion超越了现有方法,得益于其双分支结构,它可以集成自定义全景输出的附加约束,如房间布局。代码可以从该链接获取:https://www.acm.org/dl/d/2216606-panfusion
https://arxiv.org/abs/2404.07949
Blind-spot networks (BSN) have been prevalent network architectures in self-supervised image denoising (SSID). Existing BSNs are mostly conducted with convolution layers. Although transformers offer potential solutions to the limitations of convolutions and have demonstrated success in various image restoration tasks, their attention mechanisms may violate the blind-spot requirement, thus restricting their applicability in SSID. In this paper, we present a transformer-based blind-spot network (TBSN) by analyzing and redesigning the transformer operators that meet the blind-spot requirement. Specifically, TBSN follows the architectural principles of dilated BSNs, and incorporates spatial as well as channel self-attention layers to enhance the network capability. For spatial self-attention, an elaborate mask is applied to the attention matrix to restrict its receptive field, thus mimicking the dilated convolution. For channel self-attention, we observe that it may leak the blind-spot information when the channel number is greater than spatial size in the deep layers of multi-scale architectures. To eliminate this effect, we divide the channel into several groups and perform channel attention separately. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-the-art SSID methods. The code and pre-trained models will be publicly available at this https URL.
盲点网络(BSN)是自监督图像去噪(SSID)中普遍存在的网络架构。现有的BSN主要使用卷积层。尽管Transformer为卷积的局限性提供了潜在解决方案,并在各种图像修复任务中取得了成功,但它们的注意力机制可能违反盲点要求,从而限制了其在SSID中的应用。在本文中,我们通过分析和解构满足盲点要求的Transformer操作器,提出了一种基于Transformer的盲点网络(TBSN)。具体来说,TBSN遵循扩散BSN的架构原则,并引入了空间和通道自注意层以增强网络能力。对于空间自注意,我们为注意力矩阵应用了一个详细的掩码,以限制其接收场,从而模仿扩散卷积。对于通道自注意,我们观察到,当通道数量大于在多尺度架构的深层中空间大小时,它可能泄漏盲点信息。为了消除这种效果,我们将通道分为几组,并进行通道自注意。此外,我们引入了一种知识蒸馏策略,将TBSN分解为较小的去噪器以提高计算效率,同时保持性能。在现实世界的图像去噪数据集上进行广泛的实验证明,TBSN大大扩展了接收场,并对最先进的SSID方法表现出优越的性能。代码和预训练模型将公开发布在https://这个URL上。
https://arxiv.org/abs/2404.07846
Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on perturbed bounding boxes of annotated entities. This framework, termed ConsistencyDet, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any temporal stage back to its pristine state, thereby realizing a ``one-step denoising'' mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into the definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics.
对象检测,在感知计算领域中是一个基本任务,可以使用生成方法来解决。在本文研究中,我们引入了一个新的框架,将对象检测视为去噪扩散过程,该过程对注释实体的扰动边界框进行操作。这个框架被称为ConsistencyDet,利用了一种创新的去噪概念,称为一致性模型。这个模型的特点是其自一致性特征,它使模型能够将来自任何时间阶段的扭曲信息映射回其原始状态,从而实现了一项“一步去噪”机制。这种属性显著提高了模型的操作效率,使它与传统的扩散模型区别开来。在训练阶段,ConsistencyDet通过从真实注释中引入噪声注入的边界框启动扩散序列,并调整模型以执行去噪任务。随后,在推理阶段,模型采用一种以从正态分布中随机采样边界框的的去噪采样策略。通过迭代优化,模型将一系列任意生成的边界框转化为确定的检测结果。使用标准的基准测试,如MS-COCO和LVIS,全面评估证实了ConsistencyDet在性能指标上超越了其他前沿检测器。
https://arxiv.org/abs/2404.07773
Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly ambitious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.
在自然图像的生成采样方面取得显著成就的基础上,我们提出了一个创新挑战,可能过于野心勃勃,涉及生成整个多维时间序列的样本,使其类似于图像。然而,统计挑战在于小样本量,有时可能仅包含几百个样本。这个问题尤其对于遵循从规范分布生成样本并解码或去噪以匹配真实数据分布的深度生成模型来说具有挑战性。相反,我们的方法基于信息论,旨在隐含地描述图像的分布,特别是像素之间的(全局和局部)依赖关系。我们通过经验估计其KL散度与各自边际分布的dual形式相对,从而实现这一目标。这使得我们能够在优化后的1维dual divergence空间中直接进行生成采样。具体来说,在dual空间中,表示数据分布的训练样本嵌入在两个端点之间的各种聚类中。在理论上,任何嵌入在这两个端点之间的样本都与数据分布处于同一分布中。我们生成图像新样本的关键思想是根据数据维度的梯度在聚类之间进行平滑。除了直接采样所带来的数据效率之外,我们还提出了一种估计数据分布与边际分布之间差异的算法,该算法在估计数据分布与边际分布之间的差异方面具有显著的降低样本复杂度的效果。我们通过使用多种现实世界数据集来验证这一方法,建立了它与最先进的深度学习方法相比具有优越性的理论保证和实际评估。
https://arxiv.org/abs/2404.07377
In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is this https URL.
在本文中,我们提出了GoodDrag,一种改进拖拽编辑稳定性和平衡图像质量的新方法。与现有方法遇到累积扰动并经常导致扭曲不同,GoodDrag引入了一个交替在扩散过程中进行拖动和去噪操作的AlDD框架,有效提高了结果的保真度。我们还提出了一个保持起始点原始特征的信息保留运动监督操作,用于精确操作和伪影减少。此外,我们还通过引入一个新的数据集Drag100和开发专门的质量评估指标Dragging Accuracy Index和Gemini Score,对拖拽编辑进行了基准测试。大量的实验证明,与最先进的拖拽编辑方法相比,GoodDrag在质量和数量上都有优势。项目页面URL是https://url。
https://arxiv.org/abs/2404.07206