Generating safe and reliable trajectories for autonomous vehicles in long-tail scenarios remains a significant challenge, particularly for high-lateral-acceleration maneuvers such as sharp turns, which represent critical safety situations. Existing trajectory planners exhibit systematic failures in these scenarios due to data imbalance. This results in insufficient modelling of vehicle dynamics, road geometry, and environmental constraints in high-risk situations, leading to suboptimal or unsafe trajectory prediction when vehicles operate near their physical limits. In this paper, we introduce ReflexDiffusion, a novel inference-stage framework that enhances diffusion-based trajectory planners through reflective adjustment. Our method introduces a gradient-based adjustment mechanism during the iterative denoising process: after each standard trajectory update, we compute the gradient between the conditional and unconditional noise predictions to explicitly amplify critical conditioning signals, including road curvature and lateral vehicle dynamics. This amplification enforces strict adherence to physical constraints, particularly improving stability during high-lateral-acceleration maneuvers where precise vehicle-road interaction is paramount. Evaluated on the nuPlan Test14-hard benchmark, ReflexDiffusion achieves a 14.1% improvement in driving score for high-lateral-acceleration scenarios over the state-of-the-art (SOTA) methods. This demonstrates that inference-time trajectory optimization can effectively compensate for training data sparsity by dynamically reinforcing safety-critical constraints near handling limits. The framework's architecture-agnostic design enables direct deployment to existing diffusion-based planners, offering a practical solution for improving autonomous vehicle safety in challenging driving conditions.
在长尾场景中为自动驾驶汽车生成安全且可靠的轨迹仍然是一个重大挑战,尤其是在需要高横向加速度的操作(如急转弯)的情况下,这种情况对于车辆的安全性尤为重要。现有的轨迹规划器由于数据不平衡,在这些场景中表现出系统性的失败,这导致对车辆动力学、道路几何和环境约束的建模不足,从而在车辆接近物理极限时产生次优或不安全的路径预测。 本文介绍了一种名为ReflexDiffusion的新颖推理阶段框架,通过反射调整增强了基于扩散的方法的轨迹规划器。我们的方法引入了迭代去噪过程中的梯度基调整机制:每次标准轨迹更新后,我们计算条件和非条件噪声预测之间的梯度,以明确放大关键调控信号,包括道路曲率和横向车辆动态。这种放大强制执行严格的物理约束遵守,尤其是在高横向加速度操作中,精确的车路相互作用至关重要。 在nuPlan Test14-hard基准测试上进行评估时,ReflexDiffusion在高横向加速度场景下比最先进的方法(SOTA)提高了驾驶评分14.1%。这表明推理时间轨迹优化可以通过动态强化处理极限附近的临界安全约束来有效补偿训练数据的稀疏性。 该框架的设计与架构无关,可以直接部署到现有的基于扩散的方法规划器中,提供了一种在挑战性驾驶条件下提高自动驾驶车辆安全性的确切解决方案。
https://arxiv.org/abs/2601.09377
Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes--such as location, length, curvature, and branching pattern--to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and--most importantly--boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.
变电站电表在监测和确保电网稳定运行中扮演着关键角色,然而由于标注样本的极度稀缺性,它们检测裂缝及其他物理缺陷的能力常常受限。为了解决这一少量数据生成挑战,我们提出了一种新颖框架,该框架将知识嵌入与超网络引导条件控制相结合,并将其整合到一个稳定的扩散模型(Stable Diffusion)管道中,从而能够从有限的数据中合成出真实且可控的缺陷图像。 首先,通过采用DreamBooth风格的知识嵌入方法对预训练的Stable Diffusion骨干网络进行微调,我们弥合了自然图象预训练模型与工业设备之间的巨大领域差距。这一过程编码了变电站电表特有的结构和纹理先验知识,确保生成的图像保留真实电表特性。 其次,我们引入了一个几何裂缝建模模块,该模块参数化缺陷属性(如位置、长度、曲率及分支模式)以生成空间受限控制图。这些地图在生成过程中提供精确到像素级别的指导信息。 第三,设计了一个轻量级超网络,在扩散模型去噪过程中动态调节响应于控制地图和高层次缺陷描述符,实现生成真实度与可控性的灵活平衡。 通过一个实际的变电站电表数据集进行广泛实验表明,我们的方法在性能上显著优于现有的增强和生成基准。它将Frechet Inception距离(FID)降低了32.7%,提高了多样性和关键性地提升了当训练数据增加时下游缺陷检测器的平均精度(mAP),增幅为15.3%。 该框架提供了一种实用且高质量的数据合成解决方案,适用于工业检查系统中缺陷样本稀缺的情况。
https://arxiv.org/abs/2601.09238
We propose a novel piecewise smooth image model with piecewise constant local parameters that are automatically adapted to each image. Technically, the model is formulated in terms of factor graphs with NUP (normal with unknown parameters) priors, and the pertinent computations amount to iterations of conjugate-gradient steps and Gaussian message passing. The proposed model and algorithms are demonstrated with applications to denoising and contrast enhancement.
我们提出了一种新颖的分段平滑图像模型,该模型具有局部常数参数,并且这些参数会自动适应每一张图像。技术上讲,该模型通过带有未知参数正态先验(NUP)的因子图进行表述,相关的计算涉及共轭梯度步骤和高斯消息传递的迭代。我们通过去噪和对比度增强的应用展示了所提出的模型和算法的有效性。
https://arxiv.org/abs/2601.08749
Aims: This study investigates whether a U-Net architecture can perform standalone end-to-end blind deconvolution of astronomical images without any prior knowledge of the Point Spread Function (PSF) or noise characteristics. Our goal is to evaluate its performance against the number of training images, classical Tikhonov deconvolution and to assess its generalization capability under varying seeing conditions and noise levels. Methods: Realistic astronomical observations are simulated using the GalSim toolkit, incorporating random transformations, PSF convolution (accounting for both optical and atmospheric effects), and Gaussian white noise. A U-Net model is trained using a Mean Square Error (MSE) loss function on datasets of varying sizes, up to 40,000 images of size 48x48 from the COSMOS Real Galaxy Dataset. Performance is evaluated using PSNR, SSIM, and cosine similarity metrics, with the latter employed in a two-model framework to assess solution stability. Results: The U-Net model demonstrates effectiveness in blind deconvolution, with performance improving consistently as the training dataset size increases, saturating beyond 5,000 images. Cosine similarity analysis reveals convergence between independently trained models, indicating stable solutions. Remarkably, the U-Net outperforms the oracle-like Tikhonov method in challenging conditions (low PSNR/medium SSIM). The model also generalizes well to unseen seeing and noise conditions, although optimal performance is achieved when training parameters include validation conditions. Experiments on synthetic $C^\alpha$ images further support the hypothesis that the U-Net learns a geometry-adaptive harmonic basis, akin to sparse representations observed in denoising tasks. These results align with recent mathematical insights into its adaptive learning capabilities.
目标:本研究探讨U-Net架构是否能够在没有任何关于点扩散函数(PSF)或噪声特性的先验知识的情况下,独立完成天文图像的端到端盲去卷积。我们的目标是评估其性能与训练图像数量的关系,并将其性能与传统的Tikhonov去卷积方法进行比较,同时评估它在不同的视见条件和噪声水平下的泛化能力。 方法:使用GalSim工具包模拟现实中的天文观测数据,包括随机变换、PSF卷积(考虑到光学效应和大气效应)以及高斯白噪声。通过均方误差(MSE)损失函数训练U-Net模型,并根据不同大小的数据集进行训练,最大数据集中包含40,000张来自COSMOS真实星系数据集的48x48像素图像。使用峰值信噪比(PSNR)、结构相似性指数(SSIM)和余弦相似度等指标评估性能,并在两个模型框架中应用余弦相似度来评估解的稳定性。 结果:U-Net模型显示了盲去卷积的有效性,随着训练数据集大小增加,其性能稳步提高,在超过5,000张图像后达到饱和点。余弦相似度分析表明独立训练模型之间存在收敛现象,这暗示了解的稳定性。令人惊讶的是,即使在困难条件下(低PSNR/中等SSIM),U-Net的表现也超过了类似oracle的Tikhonov方法。该模型还很好地泛化到未见的视见和噪声条件,尽管最佳性能仅在其训练参数包括验证条件时才能实现。对于合成$C^\alpha$图像的实验进一步支持了U-Net学习几何自适应谐波基(类似于去噪任务中观察到的稀疏表示)这一假设。这些结果与最近对它自适应学习能力的数学洞察相符。
https://arxiv.org/abs/2601.08666
We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
我们介绍了一种用于分析脑电图(EEG)信号的两阶段多任务学习框架,该框架集成了去噪、动态建模和表示学习。在第一阶段,训练一个去噪自编码器来抑制伪迹并稳定时间动力学,从而提供鲁棒的信号表示。在第二阶段,一个多任务架构处理这些去噪后的信号以实现三个目标:运动想象分类、使用Lyapunov指数标签区分混沌与非混沌状态,以及利用NT-Xent损失进行自我监督对比表征学习。 该框架采用卷积骨干网络结合Transformer编码器来捕捉时空结构,而动态任务则鼓励对非线性脑动力学的敏感度。这种分阶段的设计有助于缓解重构和判别目标之间的干扰,提高数据集间的稳定性,并通过明确区分噪声减少与高级特征学习支持可重复训练。 实证研究表明,我们的框架不仅增强了鲁棒性和泛化能力,还在EEG解码方面超越了强大的基线方法以及最近的最先进方法,突显了结合去噪、动态特性及自我监督学习的有效性。
https://arxiv.org/abs/2601.08549
A lack of standardized datasets has long hindered progress in automatic intrapulse modulation classification (AIMC) - a critical task in radar signal analysis for electronic support systems, particularly under noisy or degraded conditions. AIMC seeks to identify the modulation type embedded within a single radar pulse from its complex in-phase and quadrature (I/Q) representation, enabling automated interpretation of intrapulse structure. This paper introduces AIMC-Spec, a comprehensive synthetic dataset for spectrogram-based image classification, encompassing 33 modulation types across 13 signal-to-noise ratio (SNR) levels. To benchmark AIMC-Spec, five representative deep learning algorithms - ranging from lightweight CNNs and denoising architectures to transformer-based networks - were re-implemented and evaluated under a unified input format. The results reveal significant performance variation, with frequency-modulated (FM) signals classified more reliably than phase or hybrid types, particularly at low SNRs. A focused FM-only test further highlights how modulation type and network architecture influence classifier robustness. AIMC-Spec establishes a reproducible baseline and provides a foundation for future research and standardization in the AIMC domain.
长久以来,缺乏标准化的数据集阻碍了自动内脉冲调制分类(AIMC)的进展——这是一个雷达信号分析中至关重要的任务,尤其是在噪声或退化条件下。AIMC旨在从复杂的同相和正交(I/Q)表示中识别单个雷达脉冲内的调制类型,从而实现对内脉冲结构的自动化解读。本文介绍了AIMC-Spec,这是一个全面的人工合成数据集,用于基于频谱图的图像分类,涵盖了33种调制类型,并在13个信噪比(SNR)级别上进行了测试。 为了基准化AIMC-Spec,研究者重新实现了五种具有代表性的深度学习算法——从轻量级CNNs和去噪架构到基于变换器的网络,并以统一的输入格式对其进行了评估。结果表明,在不同信噪比下,频率调制(FM)信号被分类得更为可靠,尤其是在低SNR条件下。一项专门针对FM的测试进一步展示了调制类型与网络架构如何影响分类器的鲁棒性。 AIMC-Spec建立了一个可重复的基础,并为未来在自动内脉冲调制分类领域的研究和标准化工作提供了基础。
https://arxiv.org/abs/2601.08265
While Graph Neural Networks (GNNs) excel on graph-structured data, their performance is fundamentally limited by the quality of the observed graph, which often contains noise, missing links, or structural properties misaligned with GNNs' underlying assumptions. To address this, graph structure learning aims to infer a more optimal topology. Existing methods, however, often incur high computational costs due to complex generative models and iterative joint optimization, limiting their practical utility. In this paper, we propose GADPN, a simple yet effective graph structure learning framework that adaptively refines graph topology via low-rank denoising and generalized structural perturbation. Our approach makes two key contributions: (1) we introduce Bayesian optimization to adaptively determine the optimal denoising strength, tailoring the process to each graph's homophily level; and (2) we extend the structural perturbation method to arbitrary graphs via Singular Value Decomposition (SVD), overcoming its original limitation to symmetric structures. Extensive experiments on benchmark datasets demonstrate that GADPN achieves state-of-the-art performance while significantly improving efficiency. It shows particularly strong gains on challenging disassortative graphs, validating its ability to robustly learn enhanced graph structures across diverse network types.
尽管图神经网络(GNNs)在处理图结构数据方面表现出色,但它们的性能从根本上受到观察到的图质量的影响。这些图往往包含噪声、缺失链接或与GNN假设不一致的结构性质。为了应对这一挑战,图结构学习的目标是推断出更优化的拓扑结构。然而,现有的方法通常由于复杂的生成模型和迭代联合优化而产生高昂的计算成本,限制了它们的实际应用价值。 在本文中,我们提出了GADPN——一个简单且有效的图结构学习框架,通过低秩去噪和广义结构扰动自适应地精炼图形拓扑。我们的方法做出了两个关键贡献:(1)引入贝叶斯优化以自适应确定最佳的去噪强度,使过程能够根据每个图的同质性水平进行定制;以及(2)利用奇异值分解(SVD)扩展结构扰动法至任意图,克服了其仅限于对称结构的初始限制。在基准数据集上的大量实验表明,GADPN实现了最先进的性能,并显著提高了效率。特别是在挑战性的异配图上表现出特别强的优势,验证了它能够稳健地学习各种网络类型中的增强图形结构的能力。
https://arxiv.org/abs/2601.08230
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
扩散模型在生成高可能性样本方面表现出色,但通常需要与下游目标进行对齐。现有的针对扩散模型的微调方法在应对奖励过优化问题时效果显著不佳,导致虽然得到高评分但是却显得不自然且多样性降低的样本出现。为了缓解这一问题,我们提出了一种名为Soft Q基础扩散微调(SQDF)的新方法,这是一种用于扩散对齐的KL正则化强化学习(RL)方法。该方法采用了一个重新参数化的策略梯度,这个策略梯度是通过无训练、可微分估计得到的软Q函数来计算。 为了进一步增强SQDF的效果,我们引入了三个创新点:一个折扣因子用于在去噪过程中进行适当的信用分配;整合一致性模型以精炼Q函数的估算;使用离线策略重播缓冲区以提高模式覆盖率,并管理奖励与多样性的权衡。我们的实验表明,SQDF能够在文本到图像对齐任务中同时获得目标奖励并且保持多样性。此外,在在线黑盒优化场景下,SQDF也实现了高样本效率的同时确保了自然性和多样性。 这项工作的贡献在于提供了一种新的方法来解决扩散模型微调中的过拟合问题,并在实际应用中取得了良好的效果和多样性的平衡,这对于生成领域的研究具有重要意义。
https://arxiv.org/abs/2512.04559
Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., "an image of an [object class]"), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at this https URL.
零样本图像异常检测(ZSAD)的目标是在没有目标数据正常训练样本的情况下,检测并定位异常。尽管最近的ZSAD方法利用了诸如语言等额外模态来生成用于定位的细粒度提示,但仅基于视觉的方法仍局限于图像级分类,缺乏空间精度。在这项工作中,我们引入了一个简单而有效的无训练、仅依赖于视觉的ZSAD框架,该框架通过利用预训练的去噪扩散隐式模型(DDIM)的反演过程来规避对细粒度提示的需求。具体而言,给定一个输入图像和一个通用文本描述(例如,“一张[物体类别]的照片”),我们将图像反转以获得潜在表示,并从固定的中间时间步开始启动去噪过程以重构图像。由于底层扩散模型仅在正常数据上进行训练,因此该过程会产生看起来正常的重构结果。输入图像与重建图像之间的差异突显了可能存在的异常。 我们的方法在VISA数据集上达到了最先进的性能,在没有辅助模态的情况下展示了强大的定位能力,并促进了零样本异常检测研究摆脱提示依赖的趋势。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.08022
Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
目前,条件于文本的扩散模型编辑器在处理单一对象替换方面表现出色,但在同时引入新对象和新风格时则显得力不从心。我们提出了一种名为双提示注意力融合(Twin-Prompt Attention Blend,TP-Blend)的轻量级无训练框架,该框架接收两个独立的文字提示——一个指定要融合的对象,另一个定义目标风格,并将这两个提示同时注入到单一去噪轨迹中。 TP-Blend 由两种互补的注意处理器驱动:跨注意力对象融合(Cross-Attention Object Fusion, CAOF)和自注意力风格融合(Self-Attention Style Fusion, SASF)。CAOF 首先通过平均每个头的关注度来定位对任一提示有强烈响应的空间标记,然后解决一个熵正则化的最优传输问题,将完整的多头特征向量重新分配到这些位置。CAOF 更新所有头部的完整组合维度(例如 SD-XL 中为 640 维)上的特征向量,同时保持跨头相关性的丰富性并维持较低的记忆占用。 SASF 则通过细节敏感实例归一化在每个自注意力层注入风格。一种轻便的一维高斯滤波器将低频和高频成分分离;只有高频残差被融合回图像中,以微米级的笔触纹理印刻而不会破坏全局几何结构。SASF 进一步通过交换由风格提示派生的关键矩阵与值矩阵来实现上下文感知的纹理调节,并确保这种调节独立于对象融合。 大量实验表明,TP-Blend 能够生成高分辨率、逼真的编辑内容,同时对内容和外观都有精确控制,在定量保真度、感知质量和推理速度方面超越了最近的基础线方法。
https://arxiv.org/abs/2601.08011
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets are available at this https URL.
带掩码的扩散模型(MDM)利用双向注意力和去噪过程,正在缩小与自回归模型(ARM)之间的性能差距。然而,它们内部的关注机制仍待深入研究。本文探讨了MDMs中的注意行为,揭示了“Attention Floating”现象的存在。不同于ARM中注意力收敛到固定焦点的情况,在MDMs中,注意力呈现出动态、分散的锚点,这些锚点在去噪步骤和层之间移动。进一步分析表明其具备浅层次结构感知、深层次内容聚焦的关注机制:浅层利用漂浮令牌建立全局结构框架,而深层则更多地致力于捕捉语义内容。实证研究表明,这种独特关注模式为MDMs强大的上下文学习能力提供了一种机制解释,并使它们在知识密集型任务中比ARM的性能翻倍。所有代码和数据集可在以下链接获取:[此处插入URL]。
https://arxiv.org/abs/2601.07894
Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model's internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
大型音频语言模型(LALM)在实时场景中得到了广泛应用,例如车载助手和在线会议理解。然而,在实践中,音频输入常常受到设备噪音和环境噪音的污染,导致性能下降。尽管如此,现有的关于噪音影响的研究主要依赖直觉和经验观察,并没有进行量化的分析,因而无法充分了解模型的实际鲁棒性。 为解决这一问题,我们引入了信号嵌入能量(SEE)方法来量化噪音强度对LALM输入的影响,这有助于在实际部署中区分LALM的鲁棒性。SEE提出了一种基于从模型内部表示衍生出的结构化激活子空间的方法视角,比原始音频特征更能准确捕捉模型对于噪声的感知。 实验结果显示,SEE与LALM性能之间具有很强的相关性,达到了0.98的相关系数。令人惊讶的是,传统的音频降噪方法对LALMs的效果微乎其微,在某些情况下甚至会增加SEE并损害性能。这表明以语音为中心的降噪目标和现代LALM对噪音的敏感度存在不匹配。 基于这一观察,我们提出了一个源自SEE的新策略来为LALM输入降噪,并且此方法优于现有的降噪技术。本文提出了一种针对LALM中噪声量化的新型度量标准,这将为实际部署中的鲁棒性改进提供指导。
https://arxiv.org/abs/2601.07331
The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).
图像到视频(I2V)生成的任务旨在从参考图像和文本提示中合成一个视频。这要求扩散模型在去噪过程中解决高频视觉约束与低频文本指导之间的协调问题。然而,尽管现有的I2V模型重视视觉一致性,如何有效地结合这种双重指导以确保对文本提示的严格遵守仍然是未被充分探索的问题。 在这项工作中,我们观察到,在基于Diffusion Transformer(DiT)的I2V模型中,某些中间层表现出弱语义响应(称为语义较弱层),这可以通过文本与视觉相似度下降来衡量。我们认为这是由于条件隔离现象造成的,即对视觉特征的关注部分脱离了文本指导,并过度依赖于已学习到的视觉先验知识。 为了解决这一问题,我们提出了焦点指引(Focal Guidance, FG),该方法增强了从语义较弱层获取控制的能力。FG包括两个机制: 1. **细粒度语义引导**(Fine-grained Semantic Guidance, FSG):利用CLIP识别参考帧中的关键区域,并将这些区域用作锚点,以指导语义较弱层。 2. **注意缓存**(Attention Cache):从具有高语义响应的层中转移注意力图到语义较弱层,注入显式的语义信号,减轻对模型学习得到的视觉先验知识过度依赖的问题,从而提高对文本指令的遵守程度。 为了进一步验证我们的方法,并解决该领域评估不足的问题,我们引入了一个新的基准测试来衡量I2V模型在遵循指令方面的性能。在这个基准上,焦点指引证明了其有效性和泛化能力,在Wan2.1-I2V上将总分提升至0.7250(+3.97%),并在基于MMDiT的HunyuanVideo-I2V中将其得分提高到0.5571(+7.44%)。
https://arxiv.org/abs/2601.07287
Stable Diffusion (SD) often produces degraded outputs when the training dataset contains adversarial noise. Adversarial purification offers a promising solution by removing adversarial noise from contaminated data. However, existing purification methods are primarily designed for classification tasks and fail to address SD-specific adversarial strategies, such as attacks targeting the VAE encoder, UNet denoiser, or both. To address the gap in SD security, we propose Universal Diffusion Adversarial Purification (UDAP), a novel framework tailored for defending adversarial attacks targeting SD models. UDAP leverages the distinct reconstruction behaviors of clean and adversarial images during Denoising Diffusion Implicit Models (DDIM) inversion to optimize the purification process. By minimizing the DDIM metric loss, UDAP can effectively remove adversarial noise. Additionally, we introduce a dynamic epoch adjustment strategy that adapts optimization iterations based on reconstruction errors, significantly improving efficiency without sacrificing purification quality. Experiments demonstrate UDAP's robustness against diverse adversarial methods, including PID (VAE-targeted), Anti-DreamBooth (UNet-targeted), MIST (hybrid), and robustness-enhanced variants like Anti-Diffusion (Anti-DF) and MetaCloak. UDAP also generalizes well across SD versions and text prompts, showcasing its practical applicability in real-world scenarios.
稳定扩散(SD)在训练数据集中包含对抗性噪声时,常常会产生劣质的输出。对抗净化通过从受污染的数据中移除对抗性噪声提供了一个有希望的解决方案。然而,现有的净化方法主要针对分类任务设计,并未能解决与SD模型特有的对抗策略相关的难题,如攻击VAE编码器、UNet去噪器或两者同时受到的攻击。为填补SD安全领域的空白,我们提出了通用扩散对抗净化(UDAP),这是一个专为防御针对SD模型的对抗性攻击而量身定制的新框架。UDAP利用了在噪声去除扩散隐式模型(DDIM)逆向过程中干净图像和对抗性图像的独特重建行为来优化净化过程。通过最小化DDIM度量损失,UDAP可以有效地移除对抗性噪声。 此外,我们还引入了一种动态轮次调整策略,该策略根据重建错误自适应地调整优化迭代次数,在不牺牲净化质量的前提下显著提高了效率。实验表明,UDAP对多种对抗方法具有强大的鲁棒性,包括针对VAE的PID、针对UNet的Anti-DreamBooth、混合型MIST以及强化版的Anti-Diffusion(Anti-DF)和MetaCloak等变体。此外,UDAP在不同版本的SD模型及文本提示下也能表现出良好的泛化能力,展示了其在实际应用场景中的实用价值。
https://arxiv.org/abs/2601.07253
Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
在低质量多模态数据上进行可靠学习是一个广泛关心的问题,特别是在安全关键的应用中。然而,多模态噪声在这个领域构成了一个主要挑战,并导致现有方法存在两个关键限制:首先,它们难以有效地移除异构数据噪声,从而阻碍了鲁棒的多模态表示学习;其次,当遇到以前未见过的噪声时,它们表现出有限的适应性和泛化能力。为了解决这些问题,我们提出了测试时间自适应分层协同去噪网络(TAHCD)。 一方面,TAHCD 引入了自适应稳定子空间对齐和样本自适应置信度对齐,以可靠地移除异构噪声。这些方法考虑到了全局和实例级别的噪声,并且能够同时移除模态特定的和跨模态的噪声,从而实现了稳健的学习。 另一方面,TAHCD 引入了测试时间协同增强,该方法能够在无标签的情况下根据输入噪声自适应更新模型,提高了适应性和泛化能力。这通过协作性地增强了全局和实例级别上针对样本噪声的模态特定的和跨模态噪声联合移除过程来实现。 在多个基准上的实验表明,所提出的方法相比最先进的可靠多模态学习方法,在分类性能、稳健性和泛化方面均表现出优越的结果。
https://arxiv.org/abs/2601.07163
Low-dose Positron Emission Tomography (PET) imaging reduces patient radiation exposure but suffers from increased noise that degrades image quality and diagnostic reliability. Although diffusion models have demonstrated strong denoising capability, their stochastic nature makes it challenging to enforce anatomically consistent structures, particularly in low signal-to-noise regimes and volumetric whole-body imaging. We propose Wavelet-Conditioned ControlNet (WCC-Net), a fully 3D diffusion-based framework that introduces explicit frequency-domain structural priors via wavelet representations to guide volumetric PET denoising. By injecting wavelet-based structural guidance into a frozen pretrained diffusion backbone through a lightweight control branch, WCC-Net decouples anatomical structure from noise while preserving generative expressiveness and 3D structural continuity. Extensive experiments demonstrate that WCC-Net consistently outperforms CNN-, GAN-, and diffusion-based baselines. On the internal 1/20-dose test set, WCC-Net improves PSNR by +1.21 dB and SSIM by +0.008 over a strong diffusion baseline, while reducing structural distortion (GMSD) and intensity error (NMAE). Moreover, WCC-Net generalizes robustly to unseen dose levels (1/50 and 1/4), achieving superior quantitative performance and improved volumetric anatomical consistency.
低剂量正电子发射断层扫描(PET)成像可以减少患者的辐射暴露,但会增加噪声,从而降低图像质量和诊断可靠性。尽管扩散模型在去噪方面表现出强大的能力,但由于其随机性,在保持解剖学一致性结构方面面临挑战,尤其是在信号与噪声比低的情况下和进行体积全身成像时尤其如此。我们提出了一种名为Wavelet-Conditioned ControlNet(WCC-Net)的完全三维扩散框架,该框架通过小波表示引入明确的频域结构性先验知识来指导体积PET去噪处理。通过在预先训练好的扩散骨干网络中加入基于小波结构引导的轻量级控制分支,WCC-Net能够将解剖学结构与噪声分离,同时保留生成表达性和三维结构连续性。 大量的实验表明,WCC-Net始终优于基于CNN、GAN和扩散模型的基本方法。在内部1/20剂量测试集中,WCC-Net比一个强大的扩散基线提高了PSNR 1.21 dB 和SSIM 0.008,同时减少了结构扭曲(GMSD)和强度误差(NMAE)。此外,WCC-Net能够在未见过的剂量水平(1/50 和 1/4)上稳健地推广,表现出卓越的数量性能以及改进后的体积解剖学一致性。
https://arxiv.org/abs/2601.07093
Low-resolution image representation is a special form of sparse representation that retains only low-frequency information while discarding high-frequency components. This property reduces storage and transmission costs and benefits various image processing tasks. However, a key challenge is to preserve essential visual content while maintaining the ability to accurately reconstruct the original images. This work proposes LR2Flow, a nonlinear framework that learns low-resolution image representations by integrating wavelet tight frame blocks with normalizing flows. We conduct a reconstruction error analysis of the proposed network, which demonstrates the necessity of designing invertible neural networks in the wavelet tight frame domain. Experimental results on various tasks, including image rescaling, compression, and denoising, demonstrate the effectiveness of the learned representations and the robustness of the proposed framework.
低分辨率图像表示是一种特殊的稀疏表示形式,它保留了低频信息而丢弃了高频成分。这种特性减少了存储和传输成本,并有利于各种图像处理任务。然而,一个关键的挑战是在保持准确重构原始图像的能力的同时,保存视觉内容中的重要部分。本研究提出了LR2Flow,这是一种非线性框架,通过整合小波紧帧块与归一化流来学习低分辨率图像表示。我们对所提议网络进行了重建误差分析,这表明在小波紧帧域中设计可逆神经网络的必要性。在包括图像缩放、压缩和去噪在内的各种任务中的实验结果展示了所学表示的有效性和所提框架的鲁棒性。
https://arxiv.org/abs/2601.06834
In autonomous driving, Vision Language Models (VLMs) excel at high-level reasoning , whereas semantic occupancy provides fine-grained details. Despite significant progress in individual fields, there is still no method that can effectively integrate both paradigms. Conventional VLMs struggle with token explosion and limited spatiotemporal reasoning, while semantic occupancy provides a unified, explicit spatial representation but is too dense to integrate efficiently with VLMs. To address these challenges and bridge the gap between VLMs and occupancy, we propose SparseOccVLA, a novel vision-language-action model that unifies scene understanding, occupancy forecasting, and trajectory planning powered by sparse occupancy queries. Starting with a lightweight Sparse Occupancy Encoder, SparseOccVLA generates compact yet highly informative sparse occupancy queries that serve as the single bridge between vision and language. These queries are aligned into the language space and reasoned by the LLM for unified scene understanding and future occupancy forecasting. Furthermore, we introduce an LLM-guided Anchor-Diffusion Planner featuring decoupled anchor scoring and denoising, as well as cross-model trajectory-condition fusion. SparseOccVLA achieves a 7% relative improvement in CIDEr over the state-of-the-art on OmniDrive-nuScenes, a 0.5 increase in mIoU score on Occ3D-nuScenes, and sets state-of-the-art open-loop planning metric on nuScenes benchmark, demonstrating its strong holistic capability.
在自动驾驶领域,视觉语言模型(VLMs)擅长高层次的推理,而语义占用空间则提供细粒度的细节。尽管各个领域的进展显著,目前仍没有一种方法能够有效地整合这两种范式。传统的VLM面临着标记爆炸和时空推理能力有限的问题,而语义占用空间虽能提供统一、明确的空间表示却过于密集,难以高效地与VLM融合。为了解决这些挑战,并在VLM和占用空间之间架起桥梁,我们提出了SparseOccVLA这一新的视觉-语言-行动模型。该模型通过稀疏的占用查询将场景理解、占用预测以及轨迹规划统一起来。 从轻量级的稀疏占用编码器开始,SparseOccVLA生成紧凑且高度信息密集的稀疏占用查询作为连接视觉和语言的单一桥梁。这些查询与语言空间对齐,并由大型语言模型(LLM)进行推理以实现统一的场景理解和未来占用预测。此外,我们引入了一个基于LLM指导的锚扩散规划器,该规划器采用了解耦的锚点评分和去噪方法以及跨模型轨迹条件融合技术。 在OmniDrive-nuScenes数据集上,SparseOccVLA在CIDEr指标上比最先进的方法提高了7%;在Occ3D-nuScenes上的mIoU分数提升了0.5,并且还在nuScenes基准测试的开环规划度量中创造了新的记录。这些结果表明,SparseOccVLA具有强大的整体能力。
https://arxiv.org/abs/2601.06474
In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
在这篇论文中,我们介绍了Object-WIPER,这是一种无需训练的框架,用于从视频中移除动态对象及其相关视觉效果,并用语义一致且时间连贯的内容进行修补。我们的方法利用了一个预训练的文本到视频扩散变换器(DiT)。给定输入视频、用户提供的物体掩码以及描述目标物体及其影响的查询令牌后,我们通过视觉-文本交叉注意力和视觉自我注意来定位相关的视觉令牌。这会产生一个中间效果掩码,我们将此与用户的掩码融合以获得最终的前景令牌掩码进行替换。首先,我们通过DiT对视频进行逆向处理,得到结构化的噪声,然后在保留背景令牌的同时用高斯噪声重新初始化被屏蔽的令牌。在降噪过程中,我们会复制保存的背景令牌值以保持场景的真实性。为了应对合适的评估方法不足的问题,我们引入了一种新的对象移除度量标准,该标准奖励连续帧之间前景令牌的时间一致性、每个帧内前景和背景令牌之间的连贯性以及输入与输出前景令牌间的差异。 在DAVIS数据集和一个新的现实世界关联效果基准(WIPER-Bench)上的实验表明,Object-WIPER在所提出的度量标准上优于训练基线和无训练基线,在不进行重新训练的情况下实现了干净的移除和时间稳定重建。我们的新基准、源代码以及预训练模型将公开提供。
https://arxiv.org/abs/2601.06391
Clinical trial amendments frequently introduce delays, increased costs, and administrative burden, with eligibility criteria being the most commonly amended component. We introduce \textit{eligibility criteria amendment prediction}, a novel NLP task that aims to forecast whether the eligibility criteria of an initial trial protocol will undergo future amendments. To support this task, we release $\texttt{AMEND++}$, a benchmark suite comprising two datasets: $\texttt{AMEND}$, which captures eligibility-criteria version histories and amendment labels from public clinical trials, and $\verb|AMEND_LLM|$, a refined subset curated using an LLM-based denoising pipeline to isolate substantive changes. We further propose $\textit{Change-Aware Masked Language Modeling}$ (CAMLM), a revision-aware pretraining strategy that leverages historical edits to learn amendment-sensitive representations. Experiments across diverse baselines show that CAMLM consistently improves amendment prediction, enabling more robust and cost-effective clinical trial design.
临床试验的修改常常会导致延误、增加成本以及管理负担,而纳入标准(eligibility criteria)是最常被修改的部分。我们引入了“纳入标准修改预测”这一新颖的自然语言处理任务,旨在预测初始试验方案中的纳入标准在未来是否会进行修改。为了支持这项任务,我们发布了$\texttt{AMEND++}$基准套件,它包含两个数据集:$\texttt{AMEND}$,该数据集捕捉了来自公共临床试验的纳入标准版本历史和修改标签;以及$\verb|AMEND_LLM|$,这是一个经过LLM(大语言模型)去噪管道处理的精简子集,用于隔离实质性的变更。我们进一步提出了“感知变化的掩码语言建模”(Change-Aware Masked Language Modeling, CAMLM),这是一种修订意识预训练策略,利用历史编辑来学习对修改敏感的表示形式。在多种基线上的实验表明,CAMLM可以持续改善修改预测效果,从而实现更稳健和成本效益更高的临床试验设计。
https://arxiv.org/abs/2601.06300