Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with descending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging the benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.
基于扩散的生成模型在生成建模领域成为了强大的工具。尽管在各种时间步和非噪声水平上进行了广泛的去噪研究,但关于去噪任务的相对难易程度仍存在分歧。一些研究表明,较低的时间步具有更具有挑战性的任务,而其他研究表明,较高的时间步具有更具有挑战性的任务。为了解决这一分歧,我们的研究对任务难度进行全面审查,重点关注连续时间步中相邻概率分布的变化和收敛行为。我们的观察研究显示,在较早的时间步进行去噪会面临具有更慢收敛率和较高相对熵的挑战,表明这些较低时间步的task难度增加。在这些观察结果的基础上,我们引入了一种简单到难的学习方案,基于课程学习,以增强扩散模型的训练过程。通过将时间步或噪声水平组织成簇并按难度递减训练模型,我们促进了具有感知序号的训练状态,从较简单的去噪任务到较困难的任务,从而与同时对所有时间步训练扩散模型的传统方法相分离。通过利用课程学习的优势,我们获得了更好的性能和更快的收敛速度,同时保持与现有扩散训练技术改进的互异性。我们通过在图像生成任务中进行全面的实验来验证这些优势,包括条件概率、类条件概率和文本到图像生成。
https://arxiv.org/abs/2403.10348
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally, they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. Since this model operates in the image space, the larger the resolution of image is produced, the more memory and inference time is required, and it also does not maintain scale-specific consistency. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales. The method consists of a pretrained auto-encoder, a latent diffusion model, and an implicit neural decoder, and their learning strategies. The proposed method adopts diffusion processes in a latent space, thus efficient, yet aligned with output image space decoded by MLPs at arbitrary scales. More specifically, our arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in series. The latent diffusion process is learnt by the denoising and the alignment losses jointly. Errors in output images are backpropagated via the fixed decoder, improving the quality of output images. In the extensive experiments using multiple public benchmarks on the two tasks i.e. image super-resolution and novel image generation at arbitrary scales, the proposed method outperforms relevant methods in metrics of image quality, diversity and scale consistency. It is significantly better than the relevant prior-art in the inference speed and memory usage.
超分辨率(SR)和图像生成是计算机视觉中的重要任务,并在实际应用中得到了广泛应用。然而,大多数现有方法在固定缩放级别生成图像,并遭受过拟合和伪影的困扰。此外,它们在不同尺度上的图像输出缺乏多样性,也没有保持尺度相关的 consistency。最有价值的工作是将隐式神经表示(INR)应用于去噪扩散模型,以获得连续分辨率的高质量SR结果。由于该模型在图像空间运行,生成的图像分辨率越大,需要更多的内存和推理时间,并且也不保持尺度相关的一致性。我们提出了一个新颖的管道,可以超分辨率输入图像或任意尺度生成新图像。该方法包括预训练的自编码器、一个潜在扩散模型和一个隐式神经解码器,以及它们的学习策略。所提出的方法在隐式空间中学习扩散过程,因此既高效又与MLP解码得到的输出图像空间相一致。具体来说,我们的任意尺度解码器是由预训练自编码器的对称解码器和一个局部隐式图像函数(LIIF)组成的。去噪和alignment损失是通过联合学习获得的。输出图像中的错误通过固定的解码器反向传播,从而提高输出图像的质量。在两个任务——图像超分辨率和任意尺度图像生成——的多项公共基准上进行广泛的实验,与相关方法相比,所提出的方法在图像质量、多样性和尺度一致性方面显著优越。它在推理速度和内存使用方面与相关先贤相比明显更好。
https://arxiv.org/abs/2403.10255
The pretraining-finetuning paradigm has gained widespread adoption in vision tasks and other fields, yet it faces the significant challenge of high sample annotation costs. To mitigate this, the concept of active finetuning has emerged, aiming to select the most appropriate samples for model finetuning within a limited budget. Traditional active learning methods often struggle in this setting due to their inherent bias in batch selection. Furthermore, the recent active finetuning approach has primarily concentrated on aligning the distribution of selected subsets with the overall data pool, focusing solely on diversity. In this paper, we propose a Bi-Level Active Finetuning framework to select the samples for annotation in one shot, which includes two stages: core sample selection for diversity, and boundary sample selection for uncertainty. The process begins with the identification of pseudo-class centers, followed by an innovative denoising method and an iterative strategy for boundary sample selection in the high-dimensional feature space, all without relying on ground-truth labels. Our comprehensive experiments provide both qualitative and quantitative evidence of our method's efficacy, outperforming all the existing baselines.
预训练-微调范式已经在各种任务中得到了广泛应用,然而它面临着高样本注释成本的显著挑战。为了减轻这一挑战,出现了主动微调的概念,旨在在有限预算内选择模型微调的最佳样本。传统的主动学习方法在這種设置中往往因为它们固有的批量选择偏见而遇到困难。此外,最近主动微调方法主要集中在将选定的子集的分布与整个数据池的分布对齐,仅关注多样性。在本文中,我们提出了一个双级别主动微调框架,用于在一次射击中选择注释样本,包括两个阶段:核心样本选择(多样性)和边界样本选择(不确定性)。过程始于伪类中心的出现,然后是采用创新去噪方法和在高维特征空间中边界的迭代策略。所有这些方法都没有依赖于真实标签。我们全面的实验提供了我们方法的成效的定量和定性证据,超越了所有现有基线。
https://arxiv.org/abs/2403.10069
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. However, simultaneously modeling global and local features is rarely explored to enhance HSI denoising. In this letter, we propose a hybrid convolution and attention network (HCANet), which leverages both the strengths of convolution neural networks (CNNs) and Transformers. To enhance the modeling of both global and local features, we have devised a convolution and attention fusion module aimed at capturing long-range dependencies and neighborhood spectral correlations. Furthermore, to improve multi-scale information aggregation, we design a multi-scale feed-forward network to enhance denoising performance by extracting features at different scales. Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet. The proposed model is effective in removing various types of complex noise. Our codes are available at \url{this https URL}.
超光谱图像(HSI)去噪对于有效分析和解释超光谱数据至关重要。然而,同时建模全局和局部特征通常是一个很少被研究的问题,以增强HSI去噪效果。在本文中,我们提出了一种混合卷积和注意网络(HCANet),它利用了卷积神经网络(CNN)和Transformer的优点。为了增强对全局和局部特征的建模,我们设计了一个旨在捕捉长距离依赖关系和邻接光谱关联的卷积和注意合并模块。此外,为了提高多尺度信息聚合,我们设计了一个多尺度前馈网络,通过提取不同尺度的特征来提高去噪性能。在主流HSI数据集上进行实验结果表明,与预设相比,所提出的HCANet具有合理性和有效性。所提出的模型对于消除各种类型的复杂噪声非常有效。我们的代码可在此处访问:\url{https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https:// this <https://
https://arxiv.org/abs/2403.10067
Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content this http URL this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.
可控制球形全景图像生成在各种领域具有很大的应用潜力。然而,由于固有的球形扭曲和几何特征,导致生成的内容质量较低,因此这是一个具有挑战性的任务。在本文中,我们提出了一个新的框架SphereDiffusion来解决这些独特的问题,以更好地生成高质量和高精度可控制球形全景图像。 对于球形扭曲特征,我们通过文本编码来嵌入失真对象的语义,然后明确地构建文本-对象对应关系,更好地利用预训练知识。同时,我们采用了一种可变形的技术来减轻球形扭曲引起的语义偏差。 对于球形几何特征,由于球形旋转不变性,我们通过改进训练过程中的数据多样性和优化目标,使模型更好地学习球形几何特征。此外,我们还增强了扩散模型的去噪过程,使它能够有效利用学习到的几何特征来确保生成图像的边界连续性。 通过使用这些具体技术,Structured3D数据集上的实验表明,SphereDiffusion显著提高了可控制球形图像生成的质量,平均降低了约35%的FID。
https://arxiv.org/abs/2403.10044
We present a novel image editing scenario termed Text-grounded Object Generation (TOG), defined as generating a new object in the real image spatially conditioned by textual descriptions. Existing diffusion models exhibit limitations of spatial perception in complex real-world scenes, relying on additional modalities to enforce constraints, and TOG imposes heightened challenges on scene comprehension under the weak supervision of linguistic information. We propose a universal framework ST-LDM based on Swin-Transformer, which can be integrated into any latent diffusion model with training-free backward guidance. ST-LDM encompasses a global-perceptual autoencoder with adaptable compression scales and hierarchical visual features, parallel with deformable multimodal transformer to generate region-wise guidance for the subsequent denoising process. We transcend the limitation of traditional attention mechanisms that only focus on existing visual features by introducing deformable feature alignment to hierarchically refine spatial positioning fused with multi-scale visual and linguistic information. Extensive Experiments demonstrate that our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models.
我们提出了一个名为Text-grounded Object Generation(TOG)的新图像编辑场景,它定义为根据文本描述在真实图像中生成新的对象。现有的扩散模型在复杂的现实场景中表现出空间感知限制,需要额外的模块来强制约束,而TOG在弱监督的语言信息下对场景理解提出了更高的挑战。基于Swin-Transformer的通用框架ST-LDM是我们提出的一种方法,可以集成到任何具有训练-free反向指导的潜在扩散模型中。ST-LDM包括一个全局感知自动编码器,具有可调整的压缩级别和分层视觉特征,与可变形多模态Transformer一起生成区域级的指导,用于后续去噪过程。我们通过引入可变形特征对齐来超越传统关注机制的局限性,这种对齐基于多尺度视觉和语言信息对空间位置进行平滑。大量实验证明,我们的模型在保持扩散模型的生成能力的同时提高了注意机制的定位精度。
https://arxiv.org/abs/2403.10004
Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study will be made accessible at this https URL.
从人体姿态中理解人类动作对于与人类共享空间的辅助机器人来说至关重要,以便在下一个交互中做出明智和安全的决定。然而,精确的时间局部化和活动序列的标注是一个耗时且耗资的过程,所得的标签常常是嘈杂的。如果没有得到有效解决,标签噪声将影响模型的训练,导致识别质量下降。尽管解决这个问题很重要,但迄今为止还没有有效地解决基于骨骼的动作识别的标签噪声问题。在本研究中,我们通过在基于骨骼的人体动作识别方法中集成各种研究领域的标签去噪策略,为这一领域提供一个初步的基准。观察结果表明,这些基线在处理稀疏骨骼数据时的表现只是微不足道。因此,我们引入了一种新的方法,NoiseEraSAR,它结合了全局样本选择、协同教学和跨模态专家混合(CM-MOE)策略,旨在减轻标签噪声的负面影响。我们所提出的方法在既定基准上的表现优于其他基线,为现有的技术水平树立了新的标杆。本研究的源代码将在这个链接中公开:https://www. this URL。
https://arxiv.org/abs/2403.09975
We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations, traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score, ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available at the this https URL.
我们提出了Score-Guided Human Mesh Recovery (ScoreHMR)方法,用于解决3D人体姿态和形状重建的逆问题。这些问题通过传统的优化技术来解决,但ScoreHMR通过扩散模型的潜在空间中的分数指导来实现与图像观察的同步。扩散模型被训练来捕获给定输入图像的人体模型参数的联合分布。通过将任务特定的分数作为其去噪过程的指导,ScoreHMR有效地解决了各种应用中的逆问题,而无需重新训练无关于任务的扩散模型。我们在三个设置/应用中评估了我们的方法。这些设置/应用是:(i)单帧模型拟合;(ii)从多个未校准的视角进行重建;(iii)在视频序列中重建人类。ScoreHMR在所有设置中都一致地超过了所有优化基线。我们将我们的代码和模型公开在https:// this URL上。
https://arxiv.org/abs/2403.09623
Understanding the interactions of atoms such as forces in 3D atomistic systems is fundamental to many applications like molecular dynamics and catalyst design. However, simulating these interactions requires compute-intensive ab initio calculations and thus results in limited data for training neural networks. In this paper, we propose to use denoising non-equilibrium structures (DeNS) as an auxiliary task to better leverage training data and improve performance. For training with DeNS, we first corrupt a 3D structure by adding noise to its 3D coordinates and then predict the noise. Different from previous works on denoising, which are limited to equilibrium structures, the proposed method generalizes denoising to a much larger set of non-equilibrium structures. The main difference is that a non-equilibrium structure does not correspond to local energy minima and has non-zero forces, and therefore it can have many possible atomic positions compared to an equilibrium structure. This makes denoising non-equilibrium structures an ill-posed problem since the target of denoising is not uniquely defined. Our key insight is to additionally encode the forces of the original non-equilibrium structure to specify which non-equilibrium structure we are denoising. Concretely, given a corrupted non-equilibrium structure and the forces of the original one, we predict the non-equilibrium structure satisfying the input forces instead of any arbitrary structures. Since DeNS requires encoding forces, DeNS favors equivariant networks, which can easily incorporate forces and other higher-order tensors in node embeddings. We study the effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17 datasets and demonstrate that DeNS can achieve new state-of-the-art results on OC20 and OC22 and significantly improve training efficiency on MD17.
理解诸如在3D原子系统中的力相互作用的基本原理对于许多应用,如分子动力学和催化剂设计,是至关重要的。然而,模拟这些相互作用需要计算密集型的自初始化计算,因此训练神经网络的结果数据有限。在本文中,我们提出使用去噪非平衡结构(DeNS)作为辅助任务来更好地利用训练数据并提高性能。对于使用DeNS进行训练,我们首先通过向其3D坐标添加噪声来破坏3D结构,然后预测噪声。与之前关于去噪的工作不同,它们仅限于平衡结构,而本提出的方法将去噪扩展到了更大的非平衡结构集合中。主要区别在于,非平衡结构不对应于局部能量最小值,并且具有非零的力,因此相对于平衡结构,它可能具有许多可能的原子位置。这使得去噪非平衡结构成为一个有挑战性的问题,因为去噪的目标并不唯一确定。我们的关键洞见是,在去噪的同时,还编码原始非平衡结构的力,以便指定我们要去噪的非平衡结构。具体来说,给定一个污染的非平衡结构和原始结构的力,我们预测满足输入力的非平衡结构。由于DeNS需要编码力,DeNS倾向于等价网络,这些网络可以轻松地将力和其他高阶张量编码到节点嵌入中。我们在OC20、OC22和MD17数据集上研究使用DeNS训练等价网络的有效性,并证明DeNS可以在OC20和OC22上实现与现有最佳结果相同,同时在MD17上显著提高训练效率。
https://arxiv.org/abs/2403.09549
Trajectory prediction is an essential component in autonomous driving, particularly for collision avoidance systems. Considering the inherent uncertainty of the task, numerous studies have utilized generative models to produce multiple plausible future trajectories for each agent. However, most of them suffer from restricted representation ability or unstable training issues. To overcome these limitations, we propose utilizing the diffusion model to generate the distribution of future trajectories. Two cruxes are to be settled to realize such an idea. First, the diversity of intention is intertwined with the uncertain surroundings, making the true distribution hard to parameterize. Second, the diffusion process is time-consuming during the inference phase, rendering it unrealistic to implement in a real-time driving system. We propose an Intention-aware denoising Diffusion Model (IDM), which tackles the above two problems. We decouple the original uncertainty into intention uncertainty and action uncertainty and model them with two dependent diffusion processes. To decrease the inference time, we reduce the variable dimensions in the intention-aware diffusion process and restrict the initial distribution of the action-aware diffusion process, which leads to fewer diffusion steps. To validate our approach, we conduct experiments on the Stanford Drone Dataset (SDD) and ETH/UCY dataset. Our methods achieve state-of-the-art results, with an FDE of 13.83 pixels on the SDD dataset and 0.36 meters on the ETH/UCY dataset. Compared with the original diffusion model, IDM reduces inference time by two-thirds. Interestingly, our experiments further reveal that introducing intention information is beneficial in modeling the diffusion process of fewer steps.
轨迹预测是自动驾驶中一个重要的组成部分,尤其是在避障系统中。为了应对任务固有的不确定性,许多研究使用生成模型生成每个智能体多个可能的未来轨迹。然而,大多数模型都受到有限的表示能力或训练问题的困扰。为了克服这些限制,我们提出使用扩散模型生成未来轨迹的分布。实现这一想法的两个关键要素是要确定。首先,意图的多样性与不确定的环境交织在一起,使得真实分布难以参数化。其次,在推理阶段扩散过程耗时,因此在实时驾驶系统中实现这一想法是不现实的。我们提出了一种意识到的去噪扩散模型(IDM),它解决了上述两个问题。我们将原始的不确定性分解为意图不确定性和动作不确定性,并使用两个依赖的扩散过程建模。为了降低推理时间,我们在意图意识扩散过程中减小变量维度,并限制了动作意识扩散过程的初始分布,导致扩散步骤减少。为了验证我们的方法,我们在斯坦福无人机数据集(SDD)和ETH/UCY数据集上进行了实验。我们的方法实现了与最先进方法相同的结果,在SDD数据集上的FDE为13.83像素,在ETH/UCY数据集上的值为0.36米。与原始扩散模型相比,IDM通过将意图信息引入模型,将推理时间减少了三分之二。有趣的是,我们的实验还进一步表明,在模型中引入意图信息有助于更好地建模扩散过程。
https://arxiv.org/abs/2403.09190
Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising task at a specific noise level. While these efforts have focused on parameter isolation and task routing, they fall short of capturing detailed inter-task relationships and risk losing semantic information, respectively. In response, we introduce Switch Diffusion Transformer (Switch-DiT), which establishes inter-task relationships between conflicting tasks without compromising semantic information. To achieve this, we employ a sparse mixture-of-experts within each transformer block to utilize semantic information and facilitate handling conflicts in tasks through parameter isolation. Additionally, we propose a diffusion prior loss, encouraging similar tasks to share their denoising paths while isolating conflicting ones. Through these, each transformer block contains a shared expert across all tasks, where the common and task-specific denoising paths enable the diffusion model to construct its beneficial way of synergizing denoising tasks. Extensive experiments validate the effectiveness of our approach in improving both image quality and convergence rate, and further analysis demonstrates that Switch-DiT constructs tailored denoising paths across various generation scenarios.
扩散模型在广泛的生成任务中取得了显著的成功。最近对扩散模型架构的改进尝试将其视为一种多任务学习形式,其中每个任务都对应于特定噪声水平上的去噪任务。尽管这些努力集中于参数隔离和任务路由,但它们不足以捕捉任务之间的详细关系,甚至可能失去语义信息。为了应对这个问题,我们引入了 Switch Diffusion Transformer(Switch-DiT),它建立了不需要牺牲语义信息来解决 conflicting 任务的跨任务关系。为了实现这一目标,我们在每个变压器模块内使用稀疏专家来利用语义信息,并通过参数隔离来促进任务之间的冲突处理。此外,我们提出了扩散先验损失,鼓励具有类似任务的任务共享其去噪路径,同时隔离具有冲突的路径。通过这些方法,每个变压器模块内都有一个共享的专家,而共同的和任务特定的去噪路径使扩散模型能够构建其有益的协同去噪方式。大量实验验证了我们在提高图像质量和收敛率方面的有效性,而进一步的分析表明,Switch-DiT 通过各种生成场景构建了定制化的去噪路径。
https://arxiv.org/abs/2403.09176
Anatomical trees play a central role in clinical diagnosis and treatment planning. However, accurately representing anatomical trees is challenging due to their varying and complex topology and geometry. Traditional methods for representing tree structures, captured using medical imaging, while invaluable for visualizing vascular and bronchial networks, exhibit drawbacks in terms of limited resolution, flexibility, and efficiency. Recently, implicit neural representations (INRs) have emerged as a powerful tool for representing shapes accurately and efficiently. We propose a novel approach for representing anatomical trees using INR, while also capturing the distribution of a set of trees via denoising diffusion in the space of INRs. We accurately capture the intricate geometries and topologies of anatomical trees at any desired resolution. Through extensive qualitative and quantitative evaluation, we demonstrate high-fidelity tree reconstruction with arbitrary resolution yet compact storage, and versatility across anatomical sites and tree complexities.
解剖树在临床诊断和治疗规划中扮演着至关重要的角色。然而,准确地表示解剖树具有挑战性,因为它们的多样性和复杂性。传统的表示树结构的医学成像方法对于视觉化血管和支气管网络非常有价值,但其在分辨率、灵活性和效率方面存在局限性。最近,隐式神经表示(INRs)作为一种精确且高效的表示形状的工具应运而生。我们提出了一种用INR表示解剖树的新方法,同时通过在INRs的空间中进行去噪扩散来捕捉一组树的结构分布。我们准确地捕捉到解剖树在任何所需分辨率下精细的几何和拓扑结构。通过广泛的质性和定量评估,我们证明了具有任意分辨率的高保真度树重建以及解剖位置和树复杂性的多样性。
https://arxiv.org/abs/2403.08974
Self-supervised speech representation learning enables the extraction of meaningful features from raw waveforms. These features can then be efficiently used across multiple downstream tasks. However, two significant issues arise when considering the deployment of such methods ``in-the-wild": (i) Their large size, which can be prohibitive for edge applications; and (ii) their robustness to detrimental factors, such as noise and/or reverberation, that can heavily degrade the performance of such systems. In this work, we propose RobustDistiller, a novel knowledge distillation mechanism that tackles both problems jointly. Simultaneously to the distillation recipe, we apply a multi-task learning objective to encourage the network to learn noise-invariant representations by denoising the input. The proposed mechanism is evaluated on twelve different downstream tasks. It outperforms several benchmarks regardless of noise type, or noise and reverberation levels. Experimental results show that the new Student model with 23M parameters can achieve results comparable to the Teacher model with 95M parameters. Lastly, we show that the proposed recipe can be applied to other distillation methodologies, such as the recent DPWavLM. For reproducibility, code and model checkpoints will be made available at \mbox{\url{this https URL}}.
自监督语音表示学习可以从原始波形中提取有意义的特征。这些特征可以有效地在多个下游任务中使用。然而,在考虑将这种方法应用于“野外”时,有两个重要问题出现:(i)它们的大规模,这可能会对边缘应用程序造成困难;(ii)它们对有害因素(如噪声和/或回声)的鲁棒性差,这些因素可能严重削弱系统的性能。在这篇论文中,我们提出了RobustDistiller,一种新颖的knowledge distillation机制,共同解决了这两个问题。同时,我们还为去噪的输入应用了多任务学习目标,以鼓励网络通过去噪学习输入的噪声无关表示。所提出的机制在12个不同的下游任务上的评估结果。无论噪声类型如何,或者噪声和回声水平如何,它都超越了几个基准。实验结果表明,具有23M参数的新学生模型可以实现与具有95M参数的教师模型相当的结果。最后,我们证明了这种方法可以应用于其他distillation方法,例如最近提出的DPWavLM。为了保证可重复性,将在\url{这个 https URL}上提供代码和模型检举。
https://arxiv.org/abs/2403.08654
We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.
我们提出了ActionDiffusion--一种在教学视频程序规划中考虑动作之间时间依赖的新型扩散模型。这是第一个在扩散模型中考虑动作之间时间依赖性的模型。与现有方法不同,这种方法在扩散过程中统一了动作之间的时间依赖性学习和动作计划中的去噪。通过将动作信息投影到噪声空间,实现了1)在噪声添加阶段添加动作嵌入,2)在噪声预测网络中引入注意机制来学习不同动作步骤之间的相关性。我们在三个教学视频基准数据集(CrossTask,Coin和NIV)上进行了广泛的实验,结果表明,我们的方法在CrossTask和NIV指标上超过了前 state-of-the-art 方法,同时在Coin指标上未达到最佳水平。我们证明了,将动作嵌入到噪音掩码中,扩散模型可以更好地学习动作的时间依赖性,并提高程序规划的性能。
https://arxiv.org/abs/2403.08591
Radiation therapy is crucial in cancer treatment. Experienced experts typically iteratively generate high-quality dose distribution maps, forming the basis for excellent radiation therapy plans. Therefore, automated prediction of dose distribution maps is significant in expediting the treatment process and providing a better starting point for developing radiation therapy plans. With the remarkable results of diffusion models in predicting high-frequency regions of dose distribution maps, dose prediction methods based on diffusion models have been extensively studied. However, existing methods mainly utilize CNNs or Transformers as denoising networks. CNNs lack the capture of global receptive fields, resulting in suboptimal prediction performance. Transformers excel in global modeling but face quadratic complexity with image size, resulting in significant computational overhead. To tackle these challenges, we introduce a novel diffusion model, MD-Dose, based on the Mamba architecture for predicting radiation therapy dose distribution in thoracic cancer patients. In the forward process, MD-Dose adds Gaussian noise to dose distribution maps to obtain pure noise images. In the backward process, MD-Dose utilizes a noise predictor based on the Mamba to predict the noise, ultimately outputting the dose distribution maps. Furthermore, We develop a Mamba encoder to extract structural information and integrate it into the noise predictor for localizing dose regions in the planning target volume (PTV) and organs at risk (OARs). Through extensive experiments on a dataset of 300 thoracic tumor patients, we showcase the superiority of MD-Dose in various metrics and time consumption.
放射治疗在癌症治疗中至关重要。经验丰富的专家通常会逐步生成高质量的剂量分布图,为优秀的放射治疗计划奠定基础。因此,自动预测剂量分布图在加速治疗过程和提供更好的制定放射治疗计划起点方面具有重要意义。 随着扩散模型的惊人预测结果在预测高频率剂量分布图方面取得了显著的成果,基于扩散模型的剂量预测方法已经得到了广泛研究。然而,现有的方法主要利用卷积神经网络(CNN)或Transformer作为去噪网络。CNN缺乏捕捉全局接收野的能力,导致预测性能较低。Transformer在全局建模方面表现出色,但是当图像尺寸较大时,面临明显的计算开销。为了应对这些挑战,我们引入了基于Mamba架构的新型扩散模型MD-Dose,用于预测胸癌患者放射治疗剂量分布。在前向过程中,MD-Dose对剂量分布图添加高斯噪声以获得纯噪声图像。在反向过程中,MD-Dose利用基于Mamba的噪声预测器预测噪声,最终输出剂量分布图。此外,我们还开发了Mamba编码器,用于提取结构信息并将之集成到噪声预测器中,用于在计划目标体积(PTV)和受威胁组织(OARs)中定位剂量区域。通过在300名胸癌患者数据集上的广泛实验,我们展示了MD-Dose在各种指标和时间开销上的优越性。
https://arxiv.org/abs/2403.08479
Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on spherical linear interpolation, where images are encoded into the noise space and then interpolated for denoising to images. However, existing methods face challenges in effectively interpolating natural images (not generated by diffusion models), thereby restricting their practical applicability. Our experimental investigations reveal that these challenges stem from the invalidity of the encoding noise, which may no longer obey the expected noise distribution, e.g., a normal distribution. To address these challenges, we propose a novel approach to correct noise for image interpolation, NoiseDiffusion. Specifically, NoiseDiffusion approaches the invalid noise to the expected distribution by introducing subtle Gaussian noise and introduces a constraint to suppress noise with extreme values. In this context, promoting noise validity contributes to mitigating image artifacts, but the constraint and introduced exogenous noise typically lead to a reduction in signal-to-noise ratio, i.e., loss of original image information. Hence, NoiseDiffusion performs interpolation within the noisy image space and injects raw images into these noisy counterparts to address the challenge of information loss. Consequently, NoiseDiffusion enables us to interpolate natural images without causing artifacts or information loss, thus achieving the best interpolation results.
基于扩散模型的图像平滑是一种有前景的方法,用于创建新鲜且有趣的图像。高级平滑方法主要集中在球形线性平滑,其中图像被编码到噪声空间,然后平滑以去噪。然而,现有方法在平滑自然图像(不是由扩散模型生成的图像)方面面临挑战,从而限制了它们的实际应用。我们的实验调查发现,这些挑战源于编码噪声的有效性,这可能不再遵守预期的噪声分布,例如正态分布。为了应对这些挑战,我们提出了NoiseDiffusion方法来纠正图像平滑中的噪声。具体来说,NoiseDiffusion通过引入微小的高斯噪声和抑制极端值来将无效噪声推向预期分布。在这种情况下,促进噪声有效性的确有助于减轻图像伪影,但约束和引入的外部噪声通常会导致信号-噪声比下降,即原始图像信息的丢失。因此,NoiseDiffusion在噪声图像空间中进行平滑,并将原始图像注入这些噪声图像中,以解决信息损失的挑战。结果,NoiseDiffusion使我们能够在不产生伪影或信息损失的情况下平滑自然图像,从而实现最佳的平滑效果。
https://arxiv.org/abs/2403.08840
Image noise and motion artifacts greatly affect the quality of brain MRI and negatively influence downstream medical image analysis. Previous studies often focus on 2D methods that process each volumetric MR image slice-by-slice, thus losing important 3D anatomical information. Additionally, these studies generally treat image denoising and artifact correction as two standalone tasks, without considering their potential relationship, especially on low-quality images where severe noise and motion artifacts occur simultaneously. To address these issues, we propose a Joint image Denoising and motion Artifact Correction (JDAC) framework via iterative learning to handle noisy MRIs with motion artifacts, consisting of an adaptive denoising model and an anti-artifact model. In the adaptive denoising model, we first design a novel noise level estimation strategy, and then adaptively reduce the noise through a U-Net backbone with feature normalization conditioning on the estimated noise variance. The anti-artifact model employs another U-Net for eliminating motion artifacts, incorporating a novel gradient-based loss function designed to maintain the integrity of brain anatomy during the motion correction process. These two models are iteratively employed for joint image denoising and artifact correction through an iterative learning framework. An early stopping strategy depending on noise level estimation is applied to accelerate the iteration process. The denoising model is trained with 9,544 T1-weighted MRIs with manually added Gaussian noise as supervision. The anti-artifact model is trained on 552 T1-weighted MRIs with motion artifacts and paired motion-free images. Experimental results on a public dataset and a clinical study suggest the effectiveness of JDAC in both tasks of denoising and motion artifact correction, compared with several state-of-the-art methods.
图像噪声和运动伪影对脑部MRI的质量有很大影响,并负向影响下游医学图像分析。以前的研究通常关注于对每个体积MRI图像切片逐层处理,从而丢失重要的3D解剖信息。此外,这些研究通常将图像去噪和伪影校正视为两个独立的任务,而没有考虑它们之间潜在的关系,特别是在低质量图像中,同时存在严重噪声和运动伪影的情况下。为了解决这些问题,我们提出了一种通过迭代学习处理带有运动伪影的噪声MRI的Joint Image Denoising and motion Artifact Correction(JDAC)框架。该框架由自适应去噪模型和反伪影模型组成。在自适应去噪模型中,我们首先设计了一种新的噪声水平估计策略,然后通过基于估计噪声方差的有条件U-Net核来适应性地减少噪声。反伪影模型采用另一个U-Net来消除运动伪影,并使用一种新的基于梯度的损失函数来保持脑部解剖结构的完整性,在运动校正过程中。这两个模型通过迭代学习框架共同进行图像去噪和伪影校正。根据噪声水平估计,应用了早停止策略来加速迭代过程。在公开数据集和临床研究中,JDAC在去噪和运动伪影校正两个任务上与几种最先进的 methods相比都取得了有效的结果。
https://arxiv.org/abs/2403.08162
In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at this https URL.
在这项工作中,我们研究了大型语言模型(LLM)直接理解视觉信号的潜力,而无需在多模态数据集上进行微调。我们方法的基本概念是将图像看作一个语言实体,并将其转化为LLM词汇表中的一组离散单词。为了实现这一目标,我们提出了 Vision-to-Language Tokenizer,简称V2T Tokenizer,它通过联合编码器-解码器、LLM词汇表和CLIP模型将图像转换为“外语”。有了这种创新性的图像编码,LLM不仅能够实现视觉理解,而且能够以自回归的方式进行图像去噪和修复。我们进行了严格的实验来验证我们的方法,包括理解任务(图像识别、图像标题和视觉问答)和图像去噪任务(修复、去模糊和位移恢复)。代码和模型可在此https URL找到。
https://arxiv.org/abs/2403.07874
Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
预训练模型具有大量数据的高水平计算机视觉任务,如CLIP和Stable Diffusion,已经在各种高级计算机视觉任务中表现出非凡的性能,例如图像理解和从语言描述中生成图像。然而,它们在低级任务(如图像修复)上的潜力仍然相对未探索。在本文中,我们探讨了这些模型以增强图像修复。预训练模型的离散特征(OSF)并不直接服务于图像修复,因此我们提出了一种名为Pre-Train-Guided Refinement Module(PTG-RM)的轻量级模块,以在具有OSF的目标修复网络中修复修复结果。PTG-RM由两个组件组成,分别是Pre-Train-Guided Spatial-Varying Enhancement(PTG-SVE)和Pre-Train-Guided Channel-Spatial Attention(PTG-CSA)。PTG-SVE实现了最佳短期和长期神经操作,而PTG-CSA增强了与修复相关的学习中的空间通道关注。 extensive实验证明,PTG-RM具有紧凑的规模($<$1M参数),有效增强了各种模型在不同任务上的修复性能,包括低光增强、去雾、去噪和去噪。
https://arxiv.org/abs/2403.06793
Point clouds are extensively employed in a variety of real-world applications such as robotics, autonomous driving and augmented reality. Despite the recent success of point cloud neural networks, especially for safety-critical tasks, it is essential to also ensure the robustness of the model. A typical way to assess a model's robustness is through adversarial attacks, where test-time examples are generated based on gradients to deceive the model. While many different defense mechanisms are studied in 2D, studies on 3D point clouds have been relatively limited in the academic field. Inspired from PointDP, which denoises the network inputs by diffusion, we propose Point Cloud Layerwise Diffusion (PCLD), a layerwise diffusion based 3D point cloud defense strategy. Unlike PointDP, we propagated the diffusion denoising after each layer to incrementally enhance the results. We apply our defense method to different types of commonly used point cloud models and adversarial attacks to evaluate its robustness. Our experiments demonstrate that the proposed defense method achieved results that are comparable to or surpass those of existing methodologies, establishing robustness through a novel technique. Code is available at this https URL.
点云在机器人学、自动驾驶和增强现实等现实应用领域得到了广泛应用。尽管点云神经网络最近取得了成功,尤其是在安全关键任务上,但确保模型的稳健性同样至关重要。评估模型稳健性的通常方法是通过对抗性攻击,其中基于梯度的测试时间示例被生成以欺骗模型。虽然二维点云的研究已经相当丰富,但三维点云在学术领域中的研究相对较少。受到PointDP的启发,我们提出了点云逐层扩散(PCLD),一种基于逐层扩散的3D点云防御策略。与PointDP不同,我们在每个层之后传播扩散去噪,以逐步增强结果。我们将我们的防御方法应用于不同类型的常用点云模型和对抗性攻击,以评估其稳健性。我们的实验结果表明,与现有方法相比,所提出的防御方法取得了类似或超越现有结果的效果,通过一种新颖的技术建立了稳健性。代码可在此链接下载:https://www.osac.org.cn/file/new_技术动态_202208/20220803104238_PCLD.pdf
https://arxiv.org/abs/2403.06698