Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: this https URL.
遥感图像超分辨率(RSISR)技术旨在从低分辨率输入中重建高分辨率的遥感影像,以支持对地面物体进行精细解释。现有方法面临三大挑战:(1) 难以提取空间异质性遥感场景中的多尺度特征;(2) 有限的先验信息导致重构图像语义不一致;(3) 几何精度与视觉质量之间的权衡问题。为解决这些问题,我们提出了纹理转移残差去噪双扩散模型(TTRD3),其具有三大创新:首先,使用平行异构卷积核进行多尺度特征提取的多尺度特征聚合模块(MFAB)。其次,一个稀疏纹理转移引导(STTG)模块,该模块从类似场景的参考图像中转移高分辨率纹理先验。第三,残差扩散和噪声扩散相结合以实现确定性重构与多样化生成的残差去噪双扩散模型框架(RDDM)。在多源遥感数据集上的实验表明,TTRD3优于现有先进方法,在LPIPS指标上较最佳基线提升1.43%,FID指标上提升3.67%。代码/模型:[此链接](https://this https URL)。 请注意,最后一行中的“this https URL”应被实际的访问链接替换。
https://arxiv.org/abs/2504.13026
Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at this https URL
确保文本到图像模型的伦理部署需要有效技术来防止有害或不适当内容的生成。虽然概念擦除方法提供了一个有希望的解决方案,但现有的基于微调的方法存在显著的局限性。无锚点的方法可能会破坏采样轨迹,导致视觉伪影,而有锚点的方法则依赖于启发式的锚概念选择。为了解决这些问题,我们引入了一种称为ANT(自动引导去噪轨迹以避免不需要的概念)的微调框架。ANT建立在一个关键洞察之上:在中期到晚期去噪阶段反转无条件指导的方向可以在不牺牲早期结构完整性的情况下精确修改内容。这激发了一个具有轨迹意识的目标,该目标保留了早期评分函数场的完整性,从而将样本引导至自然图像流形,而无需依赖启发式的锚概念选择。 对于单一概念擦除,我们提出了一种增强权重显著性图的方法来精确识别对不需要的概念贡献最大的关键参数,这使得擦除更加彻底和高效。对于多概念擦除,我们的目标函数提供了一个灵活的即插即用解决方案,可以大大提升性能。广泛的实验表明,在单个和多个概念擦除方面,ANT达到了最先进的结果,提供了高质量且安全的输出,同时不牺牲生成保真度。 代码可在以下链接获取:[此URL](请将[此URL]替换为实际提供的GitHub或其他代码托管平台上的具体网址)。
https://arxiv.org/abs/2504.12782
How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train diffusion models is the score function of the training distribution; and the networks usually used to learn the score function are expressive enough to learn this score to high accuracy. We claim that a certain feature of the DSM objective -- the fact that its target is not the training distribution's score, but a noisy quantity only equal to it in expectation -- strongly impacts whether and to what extent diffusion models generalize. In this paper, we develop a mathematical theory that partly explains this 'generalization through variance' phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with 'gaps' filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training. We also characterize how this inductive bias interacts with feature-related inductive biases.
扩散模型如何在其训练数据集之外泛化仍然是未知的,并且考虑到两个事实,这一点显得有些神秘:用于训练扩散模型的去噪分数匹配(DSM)目标函数的最优解通常是训练分布的得分函数;而通常用来学习该得分函数的网络足够表达性以实现高精度的学习。我们主张,DSM目标的一个特定特征——即其目标不是训练分布的得分,而是仅在期望值上等于该得分的噪声量——对扩散模型是否以及在何种程度上泛化产生了重大影响。在这篇论文中,我们发展了一个数学理论,部分解释了这种“通过方差泛化”的现象。 我们的理论分析借鉴了一种物理启发式的路径积分方法,用于计算几个典型欠参数和过参数扩散模型所学习的分布情况。我们发现,扩散模型有效学习到并采样的分布与训练分布相似,但具有填补‘空白’的特点,并且这种归纳偏差归因于在训练过程中使用的噪声目标的相关结构。此外,还分析了这种归纳偏置如何与与特征相关的归纳偏置相互作用。 简而言之,这篇论文通过数学理论探讨了扩散模型的泛化机制及其背后的物理原理,揭示了其学习到的分布特性以及该过程中的相关性和偏差因素。
https://arxiv.org/abs/2504.12532
This paper presents an overview of the NTIRE 2025 Image Denoising Challenge ({\sigma} = 50), highlighting the proposed methodologies and corresponding results. The primary objective is to develop a network architecture capable of achieving high-quality denoising performance, quantitatively evaluated using PSNR, without constraints on computational complexity or model size. The task assumes independent additive white Gaussian noise (AWGN) with a fixed noise level of 50. A total of 290 participants registered for the challenge, with 20 teams successfully submitting valid results, providing insights into the current state-of-the-art in image denoising.
本文介绍了NTIRE 2025图像去噪挑战赛({\sigma} = 50)的概览,重点展示了提出的各种方法及其对应的结果。主要目标是开发一种能够在不受到计算复杂度或模型大小限制的情况下,通过PSNR定量评估实现高质量去噪性能的网络架构。该任务假设存在具有固定噪声水平为50的独立加性白高斯噪声(AWGN)。共有290名参赛者注册参加了此次挑战赛,其中20支队伍成功提交了有效的结果,这为我们了解当前图像去噪领域的最先进技术水平提供了重要见解。
https://arxiv.org/abs/2504.12276
In radiation therapy planning, inaccurate segmentations of organs at risk can result in suboptimal treatment delivery, if left undetected by the clinician. To address this challenge, we developed a denoising autoencoder-based method to detect inaccurate organ segmentations. We applied noise to ground truth organ segmentations, and the autoencoders were tasked to denoise them. Through the application of our method to organ segmentations generated on both MR and CT scans, we demonstrated that the method is independent of imaging modality. By providing reconstructions, our method offers visual information about inaccurate regions of the organ segmentations, leading to more explainable detection of suboptimal segmentations. We compared our method to existing approaches in the literature and demonstrated that it achieved superior performance for the majority of organs.
在放射治疗计划中,如果风险器官的分割不准确且未被临床医生发现,则可能导致次优的治疗方案。为了解决这一挑战,我们开发了一种基于去噪自编码器的方法来检测不准确的器官分割。我们将噪声添加到真实的器官分割上,然后让自编码器对其进行去噪处理。通过将该方法应用于磁共振(MR)和计算机断层扫描(CT)图像生成的器官分割,我们证明了这种方法不受成像模式的影响。此外,我们的方法提供的重建图提供了关于不准确区域的视觉信息,从而使得次优分割的检测更加直观和可解释。我们将我们的方法与现有文献中的其他方法进行了比较,并展示了其在大多数器官上的性能更为优越。
https://arxiv.org/abs/2504.12203
Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture.
多模态情感分析(MSA)面临两大关键挑战:一是多模态融合决策逻辑的不可解释性,二是由于跨模态信息密度差异导致的模态失衡。为解决这些问题,我们提出了一种新的框架KAN-MCP,该框架将柯尔莫哥洛夫-阿诺德网络(KAN)的可解释性和多模态清洁帕累托(MCPareto)框架的鲁棒性相结合。 首先,KAN利用其一元函数分解实现了跨模态交互的透明分析。这种结构设计允许直接检查特征变换,而无需依赖外部解释工具,从而确保了高表达能力和可解释性。 其次,我们提出的MCPareto通过解决模态失衡和噪声干扰问题来增强鲁棒性。具体来说,我们引入了一种降维与去噪的多模态信息瓶颈(DRD-MIB)方法,该方法同时进行去噪和降低特征维度。这种方法为KAN提供了具有区分性的低维输入,减少了KAN的建模复杂度,同时保留了关键的情感相关信息。 此外,MCPareto利用经过DRD-MIB净化后的特性动态平衡跨模式之间的梯度贡献,确保辅助信号的有效传输而不损失,并且有效地缓解了多模态失衡的问题。这种可解释性和鲁棒性的协同效应不仅在CMU-MOSI、CMU-MOSEI和CH-SIMS v2等基准数据集上实现了卓越的性能,而且还通过KAN的可解释架构提供了直观的可视化界面。
https://arxiv.org/abs/2504.12151
In recent years, there has been a significant trend toward using large language model (LLM)-based recommender systems (RecSys). Current research primarily focuses on representing complex user-item interactions within a discrete space to align with the inherent discrete nature of language models. However, this approach faces limitations due to its discrete nature: (i) information is often compressed during discretization; (ii) the tokenization and generation for the vast number of users and items in real-world scenarios are constrained by a limited vocabulary. Embracing continuous data presents a promising alternative to enhance expressive capabilities, though this approach is still in its early stages. To address this gap, we propose a novel framework, DeftRec, which incorporates \textbf{de}noising di\textbf{f}fusion models to enable LLM-based RecSys to seamlessly support continuous \textbf{t}oken as input and target. First, we introduce a robust tokenizer with a masking operation and an additive K-way architecture to index users and items, capturing their complex collaborative relationships into continuous tokens. Crucially, we develop a denoising diffusion model to process user preferences within continuous domains by conditioning on reasoning content from pre-trained large language model. During the denoising process, we reformulate the objective to include negative interactions, building a comprehensive understanding of user preferences for effective and accurate recommendation generation. Finally, given a continuous token as output, recommendations can be easily generated through score-based retrieval. Extensive experiments demonstrate the effectiveness of the proposed methods, showing that DeftRec surpasses competitive benchmarks, including both traditional and emerging LLM-based RecSys.
近年来,基于大型语言模型(LLM)的推荐系统(RecSys)的趋势日益显著。当前的研究主要集中在使用离散空间来表示复杂的用户-项目交互,以适应语言模型的本质离散性质。然而,这种方法由于其离散特性而面临一些限制:(i) 在离散化过程中信息往往会被压缩;(ii) 实际场景中大量用户的标记和生成受到有限词汇量的约束。采用连续数据提供了一种有前途的方法来增强表达能力,尽管这种方法仍处于初级阶段。为了填补这一空白,我们提出了一种新的框架DeftRec,该框架结合了去噪扩散模型(denoising diffusion models),使基于LLM的推荐系统能够无缝支持连续标记作为输入和目标。 首先,我们引入了一个具有掩码操作和加性K路架构的强大分词器,以索引用户和项目,并捕捉它们复杂的协作关系到连续令牌中。关键的是,我们开发了一种去噪扩散模型,在预训练大型语言模型的推理内容的基础上处理用户的偏好,从而在连续领域内进行处理。在去噪过程中,我们将目标重新定义为包括负面互动,以便全面理解用户偏好,以生成有效且准确的推荐。 最后,给定一个作为输出的连续标记,可以通过基于评分的检索轻松生成推荐。广泛的实验表明所提出方法的有效性,结果显示DeftRec超越了竞争基准,包括传统和新兴的LLM基线推荐系统。
https://arxiv.org/abs/2504.12007
As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition, as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyperparameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is available at this https URL.
在深度学习领域,带有噪声标签的学习是一个开放的研究课题,在过去十年中吸引了大量关注并迅速发展。带噪标签学习对于驾驶员分心行为识别至关重要,因为现实世界中的视频数据常常包含标注错误的样本,这会影响模型的可靠性和性能。然而,在驾驶员活动识别领域,对标签噪声的学习研究还很少。 在本文中,我们提出了首个用于驾驶员活动识别任务的带噪标签学习方法。基于聚类假设,我们首先使模型能够从给定的视频数据中学习出便于聚类的低维表示,并将生成的嵌入分配到不同的簇中。随后,在每个簇内执行协同细化以平滑分类器输出。此外,我们提出了一种灵活的样本选择策略,该策略结合了两个选择标准,无需依赖任何超参数即可从训练数据集中过滤出干净样本。同时,我们在样本选择过程中引入了一个自适应参数,以强制跨类别的平衡性。 在公共Drive&Act数据集上进行的一系列全面实验(涵盖所有粒度级别)证明了我们的方法相较于其他图像分类领域衍生的标签去噪方法具有优越性能。源代码可在[此处](https://URL)获取。
https://arxiv.org/abs/2504.11966
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.
随着图像生成的成功,生成扩散模型越来越多地被应用于判别性任务中,因为像素生成提供了一个统一的感知接口。然而,直接将生成去噪过程用于判别性目标会暴露出以前很少被解决的关键差距。生成模型可以容忍中间采样误差,只要最终分布仍然是合理的,但判别性任务在整个过程中都需要严格的准确性,这在诸如参照图像分割等具有挑战性的多模态任务中尤其明显。为了解决这一差距,我们分析并增强了生成扩散过程与感知任务之间的对齐,并重点关注去噪过程中感知质量的演变。我们的研究发现如下: 1. 较早的去噪步骤对感知质量的贡献不成比例,这促使我们提出了反映不同时间步长贡献的定制化学习目标。 2. 后期的去噪步骤显示出意外的感知退化,强调了训练-去噪分布变化的敏感性,这一问题通过我们的扩散特制数据增强得到了解决。 3. 生成过程独特地支持互动性,作为可控用户界面,可以适应多轮交互中的校正提示。 我们的见解无需架构上的变更就显著改善了基于扩散的感知模型,并在深度估计、参照图像分割和一般感知任务上取得了最先进的性能。代码可在该链接获取:[https://this-url.com](https://this-url.com)
https://arxiv.org/abs/2504.11457
Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.
扩散模型通过逆向处理正向加噪过程来实现出色的照片生成,以逼近真实数据分布。在训练过程中,这些模型会从真实样本的加噪版本中预测扩散分数,并且只需一次前向传递即可完成任务,而推理则需要从白噪声开始进行迭代去噪。这种训练和推断之间的差异阻碍了推断与训练数据分布之间的一致性,因为可能存在预测偏差以及累积误差的问题。为了解决这个问题,我们提出了一种直观但有效的微调框架,称为对抗扩散调整(ADT),通过在优化过程中刺激推理过程,并通过对抗监督使最终输出与训练数据对齐来解决此问题。 具体而言,为了实现稳健的对抗性训练,ADT具有一个采用固定预训练骨干和轻量级可学习参数的孪生网络判别器。它还结合了一种图像到图像采样策略以平滑区分难度,并保留了原始扩散损失以防止鉴别器被破解。此外,我们精心约束了反向流动路径,以便在推理路径上沿用梯度传播而不会导致内存过载或梯度爆炸。 最终,在Stable Diffusion模型(v1.5、XL和v3)上的广泛实验表明,ADT显著提高了分布对齐程度以及图像质量。
https://arxiv.org/abs/2504.11423
Medical image restoration tasks aim to recover high-quality images from degraded observations, exhibiting emergent desires in many clinical scenarios, such as low-dose CT image denoising, MRI super-resolution, and MRI artifact removal. Despite the success achieved by existing deep learning-based restoration methods with sophisticated modules, they struggle with rendering computationally-efficient reconstruction results. Moreover, they usually ignore the reliability of the restoration results, which is much more urgent in medical systems. To alleviate these issues, we present LRformer, a Lightweight Transformer-based method via Reliability-guided learning in the frequency domain. Specifically, inspired by the uncertainty quantification in Bayesian neural networks (BNNs), we develop a Reliable Lesion-Semantic Prior Producer (RLPP). RLPP leverages Monte Carlo (MC) estimators with stochastic sampling operations to generate sufficiently-reliable priors by performing multiple inferences on the foundational medical image segmentation model, MedSAM. Additionally, instead of directly incorporating the priors in the spatial domain, we decompose the cross-attention (CA) mechanism into real symmetric and imaginary anti-symmetric parts via fast Fourier transform (FFT), resulting in the design of the Guided Frequency Cross-Attention (GFCA) solver. By leveraging the conjugated symmetric property of FFT, GFCA reduces the computational complexity of naive CA by nearly half. Extensive experimental results in various tasks demonstrate the superiority of the proposed LRformer in both effectiveness and efficiency.
医学图像恢复任务旨在从退化观测中恢复高质量的图像,这在许多临床场景中表现出强烈的需求,例如低剂量CT成像去噪、MRI超分辨率重建和MRI伪影去除。尽管现有的基于深度学习的恢复方法通过复杂模块取得了成功,但它们在生成计算效率高的重建结果方面仍然面临挑战,并且通常忽视了恢复结果的可靠性,在医疗系统中这一点尤为关键。为了缓解这些问题,我们提出了LRformer,这是一种基于轻量级Transformer并通过频域引导学习来增强可靠性的方法。 具体来说,受贝叶斯神经网络(BNNs)中的不确定性量化启发,我们开发了一种可靠的病灶语义先验生成器(RLPP)。RLPP利用蒙特卡洛(MC)估计器和随机采样操作,在基础医学图像分割模型MedSAM上进行多次推理,以生成足够可靠的先验。 此外,不同于直接在空间域中引入这些先验信息,我们通过快速傅里叶变换(FFT)将交叉注意力(CA)机制分解为实对称部分和虚反对称部分,从而设计了引导频率交叉注意(GFCA)求解器。利用FFT的共轭对称性质,GFCA使原始CA的计算复杂度几乎减半。 在各种任务中进行的广泛实验结果表明,在有效性和效率方面,所提出的LRformer方法都具有优越性。
https://arxiv.org/abs/2504.11286
Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.
近期,人类图像动画技术的进步主要得益于视频扩散模型的应用,然而这些模型依赖于大量的迭代去噪步骤,导致了高昂的推理成本和缓慢的速度。一种直观的解决方案是采用一致性模型(Consistency Models),这种方法通过一致性蒸馏作为有效的加速范式来提高效率。然而,仅仅在人类图像动画中使用这种策略往往会降低质量,包括视觉模糊、运动退化以及面部扭曲,特别是在动态区域更为明显。 为此,在本文中我们提出了一种名为DanceLCM的方法,并结合了几项增强措施以改善低步数下的视觉质量和运动连贯性: 1. **分段一致性蒸馏**:采用辅助轻量级头部从真实视频潜在特征(latents)中融入监督,从而缓解了单次全程生成过程中的累积误差问题。 2. **运动聚焦损失**:重点关注运动区域,并明确注入面部保真度特征以提升脸部的真实性。 通过广泛的定性和定量实验表明,DanceLCM在仅需2-4步推理的情况下便能达到与当前最先进的视频扩散模型相媲美的结果,大大降低了推理负担而不影响视频质量。我们将公开发布代码和模型以供使用。
https://arxiv.org/abs/2504.11143
Since human and environmental factors interfere, captured polyp images usually suffer from issues such as dim lighting, blur, and overexposure, which pose challenges for downstream polyp segmentation tasks. To address the challenges of noise-induced degradation in polyp images, we present AgentPolyp, a novel framework integrating CLIP-based semantic guidance and dynamic image enhancement with a lightweight neural network for segmentation. The agent first evaluates image quality using CLIP-driven semantic analysis (e.g., identifying ``low-contrast polyps with vascular textures") and adapts reinforcement learning strategies to dynamically apply multi-modal enhancement operations (e.g., denoising, contrast adjustment). A quality assessment feedback loop optimizes pixel-level enhancement and segmentation focus in a collaborative manner, ensuring robust preprocessing before neural network segmentation. This modular architecture supports plug-and-play extensions for various enhancement algorithms and segmentation networks, meeting deployment requirements for endoscopic devices.
由于人类和环境因素的干扰,捕获到的息肉图像通常会遇到诸如光线昏暗、模糊和过度曝光等问题,这些问题给下游的息肉分割任务带来了挑战。为了应对由噪声引起的息肉图像退化问题,我们提出了AgentPolyp,这是一种新型框架,它结合了基于CLIP(Contrastive Language-Image Pre-training)的语义指导以及使用轻量级神经网络进行分割的动态图像增强技术。该框架首先利用CLIP驱动的语义分析评估图像质量(例如,识别“低对比度且具有血管纹理的息肉”),并根据强化学习策略自适应地应用多模态增强操作(如去噪和调整对比度)。一个质量评估反馈循环以协作的方式优化像素级别的增强处理以及分割聚焦点,在神经网络分割前确保进行稳健的预处理。这种模块化架构支持各种增强算法和分割网络的即插即用扩展,满足内窥镜设备部署的需求。
https://arxiv.org/abs/2504.10978
Implicit neural representation (INR) has emerged as a powerful paradigm for visual data representation. However, classical INR methods represent data in the original space mixed with different frequency components, and several feature encoding parameters (e.g., the frequency parameter $\omega$ or the rank $R$) need manual configurations. In this work, we propose a self-evolving cross-frequency INR using the Haar wavelet transform (termed CF-INR), which decouples data into four frequency components and employs INRs in the wavelet space. CF-INR allows the characterization of different frequency components separately, thus enabling higher accuracy for data representation. To more precisely characterize cross-frequency components, we propose a cross-frequency tensor decomposition paradigm for CF-INR with self-evolving parameters, which automatically updates the rank parameter $R$ and the frequency parameter $\omega$ for each frequency component through self-evolving optimization. This self-evolution paradigm eliminates the laborious manual tuning of these parameters, and learns a customized cross-frequency feature encoding configuration for each dataset. We evaluate CF-INR on a variety of visual data representation and recovery tasks, including image regression, inpainting, denoising, and cloud removal. Extensive experiments demonstrate that CF-INR outperforms state-of-the-art methods in each case.
隐式神经表示(INR)作为一种强大的视觉数据表达范式已经出现。然而,传统的INR方法在原始空间中用不同的频率成分混合表示数据,并且需要手动配置若干特征编码参数(如频率参数$\omega$或秩$R$)。在这项工作中,我们提出了一种使用Haar小波变换的自进化跨频带隐式神经表示(CF-INR),该方法将数据分解为四个频率分量并在小波空间中应用INRs。通过这种方式,CF-INR能够分别表征不同的频率成分,从而提高数据表示的准确性。 为了更精确地描述跨频带组件,我们提出了一个适用于CF-INR的自进化跨频带张量分解范式,该方法自动更新每个频率分量的秩参数$R$和频率参数$\omega$。通过自我进化的优化过程,这种自进化范例消除了这些参数的手动调整需求,并为每种数据集学习定制化的跨频带特征编码配置。 我们在多种视觉数据表示与恢复任务上评估了CF-INR的表现,包括图像回归、修复(inpainting)、去噪和去除云层。广泛的实验结果表明,在所有情况下,CF-INR均优于当前最佳方法。
https://arxiv.org/abs/2504.10929
Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.
最近的视频生成研究主要集中在孤立的动作上,而互动动作(如手部和面部之间的交互)则很少被研究。这些互动对于新兴的生物识别认证系统至关重要,因为这类系统依赖于基于互动运动的防伪技术。从安全角度来看,为了训练和完善认证模型,急需大规模、高质量的互动视频。在这项工作中,我们介绍了一种新的方法来模拟逼真的手部与面部之间的互动。我们的方法同时学习空间-时间接触动力学和生物力学上合理的变形效应,使自然的互动成为可能,在这种互动中,手的动作会诱发准确反映解剖结构的面部变形,并且在保持无碰撞接触的情况下进行。 为了推进这项研究,我们提出了InterHF,这是一个大规模的手部与面部互动数据集,包含了18种互动模式和90,000个标注视频。此外,我们还提出了一种专门用于交互式动画制作的区域感知扩散模型——InterAnimate。InterAnimate利用可学习的空间和时间潜在变量来有效捕捉动态交互先验,并集成一种区域感知互动物机制将这些先验注入去噪过程中。 据我们所知,这项工作是首次系统性地研究大规模人类手部与面部互动的研究尝试。定性和定量结果表明,InterAnimate生成的动画具有高度的真实性,并树立了新的基准。代码和数据将在未来公开以促进相关领域的研究进展。
https://arxiv.org/abs/2504.10905
Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15\% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.
最近,扩散模型在许多图像生成任务中表现出最先进的性能。然而,大多数此类模型需要大量的计算资源才能实现这些成果。这在医学影像合成的应用中尤为明显,因为医学数据集(如CT扫描、MRI和电子显微镜成像)具有3D特性。本文提出了一种新的架构,可在单个GPU上对高维医学数据集进行扩散模型训练时节省内存。所提出的模型通过使用可逆UNet架构以及可逆注意力模块构建而成,从而实现以下两个贡献: 1. 去噪扩散模型,从而使内存使用量不再依赖于数据集的维度。 2. 在训练过程中减少能耗。 尽管这一新模型可以应用于多种图像生成任务中,但我们展示了它在3D BraTS2020数据集上的内存效率,导致训练过程中的峰值内存消耗最多减少了15%,同时与最先进的技术相比保持了类似的结果和图像质量。
https://arxiv.org/abs/2504.10883
Music editing is an important step in music production, which has broad applications, including game development and film production. Most existing zero-shot text-guided methods rely on pretrained diffusion models by involving forward-backward diffusion processes for editing. However, these methods often struggle to maintain the music content consistency. Additionally, text instructions alone usually fail to accurately describe the desired music. In this paper, we propose two music editing methods that enhance the consistency between the original and edited music by leveraging score distillation. The first method, SteerMusic, is a coarse-grained zero-shot editing approach using delta denoising score. The second method, SteerMusic+, enables fine-grained personalized music editing by manipulating a concept token that represents a user-defined musical style. SteerMusic+ allows for the editing of music into any user-defined musical styles that cannot be achieved by the text instructions alone. Experimental results show that our methods outperform existing approaches in preserving both music content consistency and editing fidelity. User studies further validate that our methods achieve superior music editing quality. Audio examples are available on this https URL.
音乐编辑是音乐制作中的一个重要步骤,具有广泛的应用场景,包括游戏开发和电影制作。目前大多数现有的零样本文本引导方法依赖于预训练的扩散模型,并通过正向反向扩散过程来进行编辑操作。然而,这些方法往往难以保持音乐内容的一致性。此外,仅凭文字指令通常不足以准确描述所需创作的音乐。 在本文中,我们提出了两种利用乐谱蒸馏来增强原始与编辑后音乐之间一致性的音乐编辑方法。第一种方法名为SteerMusic,这是一种粗粒度零样本编辑方法,采用了delta去噪分数。第二种方法SteerMusic+则能够通过操控代表用户定义的音乐风格的概念令牌进行细粒度个性化音乐编辑,这种方法使得仅靠文本指令无法实现将音乐编辑成任何特定用户定义的音乐风格成为可能。 实验结果表明,我们的方法在保持音乐内容一致性和编辑准确性方面优于现有方法。此外,用户研究进一步证实了我们提出的方法达到了优秀的音乐编辑质量。音频示例可在该链接上找到:[此URL](请根据原文提供确切的网址)。
https://arxiv.org/abs/2504.10826
Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
手语是一种动态的视觉语言,通过手势结合面部表情等非手动元素来表达。虽然视频录制的手语在教育和记录中被广泛使用,但由于其动态特性,在细节上进行研究具有挑战性,尤其是对于新手学习者和教师来说更为困难。本项目旨在将手语视频片段转化为静态插图,作为补充视频内容的教育资源。这一过程通常由艺术家完成,因此成本较高。我们提出了一种方法,利用生成模型理解图像的语义和几何方面的能力来绘制手语视频。我们的方法重点在于将类似素描的风格转移到手语视频上,并结合手势开始帧和结束帧以形成单一插图,同时使用箭头突出双手的方向和运动。 虽然许多风格迁移的方法在不同程度上解决了领域适应的问题,但对手势语言特别是对于手部动作和面部表情进行素描化处理仍然具有挑战性。为此,我们对扩散模型的去噪过程进行了干预,在高分辨率注意力层中注入样式作为键值,并融合图像和边缘中的几何信息作为查询。最后,通过注意机制结合开始帧和结束帧插图的关注权重来产生最终的插图,形成柔和组合。 我们的方法提供了一种成本效益高的解决方案,可以在推断时生成手语插图,弥补了教育材料中此类资源的不足。
https://arxiv.org/abs/2504.10822
Remote sensing images are widely utilized in many disciplines such as feature recognition and scene semantic segmentation. However, due to environmental factors and the issues of the imaging system, the image quality is often degraded which may impair subsequent visual tasks. Even though denoising remote sensing images plays an essential role before applications, the current denoising algorithms fail to attain optimum performance since these images possess complex features in the texture. Denoising frameworks based on artificial neural networks have shown better performance; however, they require exhaustive training with heterogeneous samples that extensively consume resources like power, memory, computation, and latency. Thus, here we present a computationally efficient and robust remote sensing image denoising method that doesn't require additional training samples. This method partitions patches of a remote-sensing image in which a low-rank manifold, representing the noise-free version of the image, underlies the patch space. An efficient and robust approach to revealing this manifold is a randomized approximation of the singular value spectrum of the geodesics' Gramian matrix of the patch space. The method asserts a unique emphasis on each color channel during denoising so the three denoised channels are merged to produce the final image.
遥感图像在特征识别和场景语义分割等众多学科中得到了广泛应用。然而,由于环境因素及成像系统的问题,图像质量常常会退化,从而影响后续的视觉任务。尽管去噪是应用前的重要步骤,但由于这些图像中的纹理具有复杂特性,目前的去噪算法难以达到最佳效果。基于人工神经网络的去噪框架虽然表现出更好的性能,但它们需要大量不同样本进行训练,这会消耗大量的资源如电力、内存和计算能力。 因此,在本文中我们提出了一种无需额外训练样本且在计算上更加高效和鲁棒的遥感图像去噪方法。该方法将遥感图像分割成若干块,并认为这些块的空间中存在一个低秩流形,即代表无噪声版本图像的基础结构。通过随机化地近似块空间中测地线Gram矩阵的奇异值谱来揭示这一流形的方法在计算上是高效且鲁棒的。此外,在去噪过程中该方法对每个色彩通道赋予独特的权重,并将三个经过处理后的色彩通道合并以生成最终图像。
https://arxiv.org/abs/2504.10820
Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.
可控场景生成可以大幅降低自动驾驶中多样化数据收集的成本。先前的研究将交通布局的生成视为预测性进展,要么一次去除整个序列中的噪声,要么通过迭代预测下一帧来实现。然而,一次性全序列去噪会阻碍在线反应能力,而后者仅基于下一帧的短期预测又缺乏精确的目标状态指导。此外,由于开放数据集中存在大量安全和常规驾驶行为,学习模型难以生成复杂或具有挑战性的场景。 为了克服这些问题,我们引入了Nexus框架,这是一个解耦的场景生成框架,通过模拟带有独立噪声状态的细粒度令牌,来改善反应性和目标导向性,同时可以生成正常情况及有挑战性的场景。该框架的核心在于集成部分噪声屏蔽训练策略和感知噪声的时间表安排,以确保在整个去噪过程中及时更新环境。 为了补充对具有挑战性场景的生成,我们收集了一个包含复杂边缘案例的数据集,其中包括540小时模拟数据(如切入、突然刹车和碰撞等高风险互动)。Nexus在保持反应性和目标导向的同时实现了更真实的场景生成,并将位移误差减少了40%。此外,我们还展示了通过数据增强方法来提升闭环规划的20%,并证明了其在安全关键性数据生成方面的能力。
https://arxiv.org/abs/2504.10485