Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
最近在三维生成领域的进展已经从多视角二维渲染方法转向了利用地面真实数据中的几何先验的3D原生潜在扩散框架。尽管取得了进步,但仍存在三个关键限制:(1) 单一潜在表示无法捕捉复杂的多部件几何形状,导致细节退化;(2) 整体潜在编码忽略了组成设计中至关重要的各部分独立性和相互关系;(3) 全局条件机制缺乏细粒度的可控性。受人类三维设计工作流程启发,我们提出了CoPart——一个以部分感知为主的扩散框架,它将三维对象分解为上下文相关的部分潜在表示,用于一致的多部件生成。这种范式提供了三个优势:i)通过部分分解减少编码复杂度;ii)支持显式的部分关系建模;iii)支持基于部分级别的条件设置。为了进一步优化预训练的扩散模型以进行联合部分潜在去噪,我们开发了一种相互指导策略,确保几何一致性和基础模型先验知识的同时实现这一目标。为大规模训练提供支持,我们构建了Partverse——一个新颖的3D部分数据集,它是通过Objaverse的自动网格分割和人工验证注释衍生而来的。广泛的实验表明,CoPart在部分级编辑、连杆对象生成以及场景组合方面具有前所未有的可控性,并展示了其卓越的能力。
https://arxiv.org/abs/2507.08772
Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.
扩散变换器作为一种基于U-net的扩散模型的替代方案,已被提出用于高保真图像和视频生成,并提供了更好的可扩展性。然而,其计算量大仍然是其实用部署中的主要障碍。现有的加速方法主要利用时间维度,例如通过在不同的扩散步骤之间重用缓存特征来实现这一点。在这里,我们提出了区域自适应潜在上采样(RALU),这是一种无需重新训练的框架,旨在沿空间维度加速推理过程。 RALU通过三个阶段进行混合分辨率采样:1)低分辨率去噪潜扩散以高效地捕获全局语义结构;2)在容易产生伪影的全分辨率区域进行自适应上采样;3)在整个全分辨率范围内对所有潜在变量进行上采样,以便进行细节精炼。为了稳定不同分辨率转换之间的生成过程,我们利用噪声时间重新调度来调整变化分辨率下的噪声水平。我们的方法通过实现FLUX最多7.0倍和Stable Diffusion 3最多3.0倍的速度提升,显著减少了计算量,同时保持了图像质量,并且在最小的质量损失情况下实现了这一目标。 此外,RALU与现有的基于时间维度的加速技术(如缓存方法)兼容,因此可以无缝集成以进一步减少推理延迟而不会影响生成质量。
https://arxiv.org/abs/2507.08422
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in this https URL.
主题一致的生成(Subject-consistent generation,SCG)旨在让文本到图像(Text-to-Image,T2I)模型在不同场景中保持主体身份的一致性。然而,现有的无需训练的方法往往以牺牲布局和姿态多样性为代价来实现一致性,这限制了视觉叙事的表现力。为了克服这一局限性,我们提出了一个名为CoDi的主题一致且姿态多样的T2I框架,该框架能够在保证主题一致性的前提下生成具有多种姿势和布局的图像。 受到扩散模型渐进性质的启发,在早期去噪步骤中会出现粗略结构,而在后期去噪步骤中则会进一步细化细节。因此,CoDi采用了两阶段策略:身份传输(Identity Transport, IT)和身份细化(Identity Refinement, IR)。IT在早期去噪步骤中运行,利用最优传输技术将身份特征以姿态感知的方式转移到每个目标图像上,在保持主题一致性的同时保留姿势多样性。IR则是在后期的去噪步骤中应用,选择最显著的身份特征来进一步细化主体细节。 通过大量定性和定量实验结果表明,CoDi在主题一致性、姿态多样性和指令忠实度方面均表现出色,并且在所有评估指标上的视觉感知和性能都优于现有方法。相关代码可以在提供的链接处找到:[请在这里插入实际的URL地址]。
https://arxiv.org/abs/2507.08396
Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
基于拖拽的编辑允许通过点控制进行精确的对象操作,为用户提供便利。然而,当前的方法往往由于只关注匹配用户定义的点而忽视了更广泛的几何结构,从而导致几何不一致问题、产生伪影或不稳定编辑的问题。我们提出了FlowDrag方法,该方法利用几何信息来进行更加准确和连贯的变换。我们的方法从图像构建3D网格,并通过能量函数引导基于用户定义拖拽点的网格变形。生成的网格位移被投影到2D空间中,并整合进UNet去噪过程,这样既能实现精确的手柄对齐目标点的操作,又能保持结构完整性。 此外,现有的拖拽编辑基准测试没有提供真实的地面实况数据(ground truth),使得评估编辑与预期变换的一致性变得困难。为此,我们提出了VFD(VidFrameDrag)基准数据集,该数据集利用视频数据集中连续镜头提供的真实帧作为地面实况数据。 在VFD Bench和DragBench两个基准测试上,FlowDrag方法都超越了现有的基于拖拽的编辑方法。
https://arxiv.org/abs/2507.08285
We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.
我们提出了自适应扩散去噪平滑法(Adaptive Diffusion Denoised Smoothing),这是一种针对视觉模型预测进行对抗样本认证的方法,并且能够根据输入情况进行调整。我们的关键见解是将一种引导式去噪扩散模型重新解释为一系列通过自适应高斯差异隐私(GDP)机制对纯噪声样本进行细化并最终生成图像的长序列操作。我们展示了这些自适应机制可以通过GDP隐私过滤器组合起来,以分析引导式去噪过程的整体鲁棒性,并提供一种可证明的认证方法,这种方法扩展了自适应随机平滑分析的应用范围。我们演示了在特定引导策略下,我们的设计可以在$\ell_2$威胁模型中同时提升ImageNet数据集上的已验证准确率和标准准确率。
https://arxiv.org/abs/2507.08163
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
我们研究了在给定离线数据集的情况下,使用在线强化学习(RL)训练和微调具有表达性的策略的问题。用在线RL训练表达性强的策略类面临着稳定价值最大化的独特挑战。与常见的高斯策略相比,扩散和流匹配等表达性策略通过一个长去噪链进行参数化,这在针对某些价值函数优化时阻碍了从动作到政策参数的稳定梯度传播。 我们的关键见解是可以通过避免直接利用表达性策略进行价值优化,并构建一种实时RL策略来最大化Q值的方式来解决稳定的值最大化问题。我们提出了Expressive Policy Optimization(EXPO),这是一种样本高效的在线RL算法,它使用一种实时策略通过两个参数化的策略——一个较大的、用稳定模仿学习目标训练的表达性强的基础策略和一个轻量级高斯编辑策略——来实现价值的最大化。这个编辑策略将从基础策略中采样的动作调整到更高的价值分布。 该实时策略利用学到的编辑策略优化基础策略中的动作,并在采样以及时间差分(TD)备份时,从基础策略的动作和编辑后的动作中选择具有最大价值的动作。 我们的方法相对于先前的方法,在给定离线数据微调预训练政策的情况下,以及使用离线数据来训练在线学习方面,平均样本效率提高了2到3倍。
https://arxiv.org/abs/2507.07986
The remarkable results for denoising in computer vision using diffusion models given in \cite{SDWMG,HJA,HHG} yield a robust mathematical justification for algorithms based on crucial properties of a sequence of Gaussian independent $N(0,1)$ random variables. In particular the derivations use the fact that a Gaussian distribution is determined by its mean and variance and that the sum of two Gaussians is another Gaussian. \bigskip The issue raised in this short note is the following: suppose we use the algorithm without any changes but replace the nature of the noise and use, for instance, uniformly distributed noise or noise with a Beta distribution, or noise which is a random superposition of two Gaussians with very different variances. One could, of course, try to modify the algorithm keeping in mind the nature of the noise, but this is not what we do. Instead we study the performance of the algorithm when used with noise that is very far in nature from the Gaussian case, where it is designed to work well. Usually these algorithms are implemented on very powerful computers. Our experiments are all carried out on a small laptop and for the smallest possible image size. Exploring how our observations are confirmed or changed when dealing in different situations remains an interesting challenge.
使用扩散模型在计算机视觉领域进行去噪所取得的显著成果(参见 \cite{SDWMG,HJA,HHG}),为基于一系列独立高斯分布 $N(0,1)$ 随机变量的关键属性的算法提供了强有力的数学依据。特别是,推导过程利用了高斯分布由其均值和方差唯一确定以及两个高斯分布之和仍为高斯分布这一事实。 本文提出的问题是:如果我们不作任何改变就使用该算法,但用其他类型的噪声(例如均匀分布的噪声、具有贝塔分布的噪声或由两个差异很大的方差的高斯分布随机叠加而成的噪声)来替代原来的噪声会怎样?当然,我们可以尝试根据噪声的特性修改算法,但这不是我们要做的事情。相反,我们研究在使用与高斯噪声性质相差甚远的噪声时该算法的表现情况。通常这些算法是在非常强大的计算机上实现的,而我们的实验则全部在一个小型笔记本电脑上进行,并且针对最小尺寸的图像展开。在未来探讨不同情况下我们的观察结果如何被证实或改变仍然是一项有趣的挑战。
https://arxiv.org/abs/2507.08059
The ever-growing volume of data in imaging sciences stemming from the advancements in imaging technologies, necessitates efficient and reliable storage solutions for such large datasets. This study investigates the compression of industrial X-ray computed tomography (XCT) data using deep learning autoencoders and examines how these compression algorithms affect the quality of the recovered data. Two network architectures with different compression rates were used, a deep convolution neural network (D-CNN) and a vector quantized variational autoencoder (VQ-VAE). The XCT data used was from a sandstone sample with a complex internal pore network. The quality of the decoded images obtained from the two different deep learning architectures with different compression rates were quantified and compared to the original input data. In addition, to improve image decoding quality metrics, we introduced a metric sensitive to edge preservation, which is crucial for three-dimensional data analysis. We showed that different architectures and compression rates are required depending on the specific characteristics needed to be preserved for later analysis. The findings presented here can aid scientists to determine the requirements and strategies for their data storage and analysis needs.
随着成像技术的进步,影像科学领域产生的数据量不断增长,这需要高效的存储解决方案来应对大规模的数据集。本研究探讨了使用深度学习自编码器压缩工业X射线计算机断层扫描(XCT)数据,并考察这些压缩算法如何影响恢复后数据的质量。实验中采用两种不同压缩率的网络架构进行对比:深层卷积神经网络(D-CNN)和向量量化变分自动编码器(VQ-VAE)。所使用的XCT数据来自具有复杂内部孔隙网络的砂岩样本。 研究团队通过量化并比较使用不同深度学习架构在不同压缩率下解码出的图像质量,将其与原始输入数据进行对比。此外,为了改进图像解码的质量评估指标,我们引入了一个能够反映边缘保持情况的度量标准,在三维数据分析中这一点至关重要。研究表明,根据需要保留的具体特性,不同的网络结构和压缩比率是必需的。 本研究的结果可以帮助科学家确定其数据存储和分析需求的标准与策略。
https://arxiv.org/abs/2507.07704
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:this https URL.
弱监督语义分割的目标是使用弱标注为每个像素分配类别标签,从而显著减少人工标注成本。尽管现有方法在光照良好的场景中取得了显著进展,但在低光环境中由于两个基本限制(严重的图像质量下降,例如低对比度、噪声和颜色失真以及弱监督的内在约束)其性能明显下降。这些因素共同导致了不可靠的类别激活图和语义模糊的伪标签,最终影响模型学习区分特征表示的能力。 为了解决这些问题,我们提出了Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation(DGKD-WLSS),这是一种新颖的框架,它将基于扩散引导的知识蒸馏(DGKD)与深度引导特性融合(DGF2)协同结合。DGKD通过基于扩散的去噪和知识蒸馏来对齐正常光照和低光特征,而DGF2则利用深度图作为照明不变的几何先验来增强结构特性学习。广泛的实验表明了DGKD-WLSS的有效性,在弱监督下的低光语义分割任务中达到了最先进的性能。 源代码已在此URL公开发布:[this https URL]。
https://arxiv.org/abs/2507.07578
While automated vehicles hold the potential to significantly reduce traffic accidents, their perception systems remain vulnerable to sensor degradation caused by adverse weather and environmental occlusions. Collective perception, which enables vehicles to share information, offers a promising approach to overcoming these limitations. However, to this date collective perception in adverse weather is mostly unstudied. Therefore, we conduct the first study of LiDAR-based collective perception under diverse weather conditions and present a novel multi-task architecture for LiDAR-based collective perception under adverse weather. Adverse weather conditions can not only degrade perception capabilities, but also negatively affect bandwidth requirements and latency due to the introduced noise that is also transmitted and processed. Denoising prior to communication can effectively mitigate these issues. Therefore, we propose DenoiseCP-Net, a novel multi-task architecture for LiDAR-based collective perception under adverse weather conditions. DenoiseCP-Net integrates voxel-level noise filtering and object detection into a unified sparse convolution backbone, eliminating redundant computations associated with two-stage pipelines. This design not only reduces inference latency and computational cost but also minimizes communication overhead by removing non-informative noise. We extended the well-known OPV2V dataset by simulating rain, snow, and fog using our realistic weather simulation models. We demonstrate that DenoiseCP-Net achieves near-perfect denoising accuracy in adverse weather, reduces the bandwidth requirements by up to 23.6% while maintaining the same detection accuracy and reducing the inference latency for cooperative vehicles.
尽管自动驾驶车辆有望大幅减少交通事故,但其感知系统仍然容易受到恶劣天气和环境遮挡导致的传感器退化的影响。集体感知(即车辆间的信息共享)为克服这些限制提供了一种有前景的方法。然而,迄今为止,在恶劣天气条件下进行集体感知的研究仍很少。因此,我们进行了首个基于LiDAR的集体感知在各种天气条件下的研究,并提出了一个新颖的任务多合一架构——用于恶劣天气下基于LiDAR的集体感知。 恶劣天气不仅会降低感知能力,还会由于引入的噪声而增加带宽需求和延迟,这些噪声也会被传输和处理。因此,在通信前进行去噪可以有效缓解这些问题。为此,我们提出了一种新颖的任务多合一架构——DenoiseCP-Net,用于在恶劣天气条件下基于LiDAR的集体感知。DenoiseCP-Net集成了体素级噪声过滤与目标检测到一个统一的稀疏卷积骨干网络中,消除了两阶段流水线相关的冗余计算。这种设计不仅减少了推理延迟和计算成本,还通过去除非信息性噪声来最小化通信开销。 为了研究DenoiseCP-Net的表现,我们扩展了著名的OPV2V数据集,并使用我们的现实天气模拟模型对其进行了雨、雪和雾的仿真。我们展示了在恶劣天气条件下,DenoiseCP-Net能够实现近乎完美的去噪精度,在保持相同检测准确性的前提下减少带宽需求高达23.6%,并降低了协同车辆的推理延迟。
https://arxiv.org/abs/2507.06976
Molecular structure elucidation from spectra is a foundational problem in chemistry, with profound implications for compound identification, synthesis, and drug development. Traditional methods rely heavily on expert interpretation and lack scalability. Pioneering machine learning methods have introduced retrieval-based strategies, but their reliance on finite libraries limits generalization to novel molecules. Generative models offer a promising alternative, yet most adopt autoregressive SMILES-based architectures that overlook 3D geometry and struggle to integrate diverse spectral modalities. In this work, we present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data using diffusion models. DiffSpectra formulates structure elucidation as a conditional generation process. Its denoising network is parameterized by Diffusion Molecule Transformer, an SE(3)-equivariant architecture that integrates topological and geometric information. Conditioning is provided by SpecFormer, a transformer-based spectral encoder that captures intra- and inter-spectral dependencies from multi-modal spectra. Extensive experiments demonstrate that DiffSpectra achieves high accuracy in structure elucidation, recovering exact structures with 16.01% top-1 accuracy and 96.86% top-20 accuracy through sampling. The model benefits significantly from 3D geometric modeling, SpecFormer pre-training, and multi-modal conditioning. These results highlight the effectiveness of spectrum-conditioned diffusion modeling in addressing the challenge of molecular structure elucidation. To our knowledge, DiffSpectra is the first framework to unify multi-modal spectral reasoning and joint 2D/3D generative modeling for de novo molecular structure elucidation.
从光谱数据中解析分子结构是化学中的一个基础问题,对化合物识别、合成和药物开发具有深远影响。传统方法依赖于专家的解读,并且不具备可扩展性。新兴的机器学习方法引入了检索策略,但这些方法依赖于有限的数据集,难以推广到新型分子上。生成模型提供了一种有希望的替代方案,然而大多数采用自回归SMILES架构的方法忽视了3D几何信息,也很难整合多种光谱模式。 在此工作中,我们提出DiffSpectra框架,这是一个使用扩散模型从多模态光谱数据中直接推断2D和3D分子结构的生成式框架。DiffSpectra将结构解析视为一个条件生成过程。其去噪网络由Diffusion Molecule Transformer参数化,这是一种SE(3)等变架构,能够整合拓扑信息与几何信息。通过SpecFormer光谱编码器提供条件,该编码器是一个基于Transformer的模型,可以从多模态光谱中捕获光谱内的及跨光谱间的依赖关系。 大量的实验表明,DiffSpectra在结构解析上达到了高精度,在采样过程中,准确恢复出精确结构的比例分别为16.01%(top-1)和96.86%(top-20)。模型从3D几何建模中获益颇多,并且受益于SpecFormer预训练及多模式条件提供。这些结果突显了光谱条件下扩散建模的有效性,解决了分子结构解析的挑战。 据我们所知,DiffSpectra是首个统一多模态光谱推理和联合2D/3D生成式建模的新框架,为从头开始进行分子结构解析提供了有效途径。
https://arxiv.org/abs/2507.06853
Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. \textit{Code and datasets will be released upon acceptance.
图像去噪是计算机视觉中的基本任务,特别是在医学超声(US)成像中,斑点噪声显著降低了图像质量。尽管近年来深度神经网络在自然图像去噪方面取得了重大进展,但这些方法不能直接应用于超声斑点噪声的处理,因为后者并非完全是随机产生的。相反,超声斑点是由体内微结构中的复杂波干涉产生的,因此具有组织依赖性。这种依赖性意味着获取两个独立的、相同场景下的嘈杂观察结果(这是Noise2Noise等先驱方法所必需的)是不可能的。此外,盲点网络也无法处理高空间相关的超声斑点噪声。 为了解决这一挑战,我们提出了Speckle2Self,这是一种新颖的自监督算法,仅使用单个嘈杂观测值即可减少斑点。该算法的关键见解是应用多尺度扰动(MSP)操作在不同尺度上引入组织依赖性变化于斑点模式中,同时保持共同解剖结构不变。这使得通过将干净图像建模为低秩信号,并隔离稀疏噪声成分来进行有效的斑点抑制成为可能。 为了证明其有效性,Speckle2Self与传统的滤波去噪算法和最新的学习方法进行了全面的比较,使用了既包括现实模拟超声图像也包含人类颈动脉超声图像的数据集。此外,来自多个不同超声设备的数据被用来评估模型的一般化能力和适应未知域内图像的能力。 待本文接受后,代码与数据集将公开发布。
https://arxiv.org/abs/2507.06828
Diffusion models have shown remarkable promise for image restoration by leveraging powerful priors. Prominent methods typically frame the restoration problem within a Bayesian inference framework, which iteratively combines a denoising step with a likelihood guidance step. However, the interactions between these two components in the generation process remain underexplored. In this paper, we analyze the underlying gradient dynamics of these components and identify significant instabilities. Specifically, we demonstrate conflicts between the prior and likelihood gradient directions, alongside temporal fluctuations in the likelihood gradient itself. We show that these instabilities disrupt the generative process and compromise restoration performance. To address these issues, we propose Stabilized Progressive Gradient Diffusion (SPGD), a novel gradient management technique. SPGD integrates two synergistic components: (1) a progressive likelihood warm-up strategy to mitigate gradient conflicts; and (2) adaptive directional momentum (ADM) smoothing to reduce fluctuations in the likelihood gradient. Extensive experiments across diverse restoration tasks demonstrate that SPGD significantly enhances generation stability, leading to state-of-the-art performance in quantitative metrics and visually superior results. Code is available at \href{this https URL}{here}.
扩散模型在图像恢复方面展现出了巨大的潜力,通过利用强大的先验知识。该领域的显著方法通常将恢复问题置于贝叶斯推理框架内,迭代地结合去噪步骤与似然引导步骤。然而,在生成过程中这两个组件之间的相互作用仍然未得到充分探索。在这篇论文中,我们分析了这些组成部分的底层梯度动力学,并识别出一些重要的不稳定性。具体而言,我们展示了先验和似然梯度方向间的冲突,以及似然梯度本身的时间波动。我们证明了这些不稳定性会干扰生成过程并损害恢复性能。 为了解决这些问题,我们提出了稳定渐进式梯度扩散(SPGD),这是一种新颖的梯度管理技术。SPGD集成了两个协同组件:(1) 一种渐进式的似然预热策略来缓解梯度冲突;以及 (2) 自适应方向动量(ADM)平滑以减少似然梯度的波动。 在各种恢复任务上的广泛实验表明,SPGD显著增强了生成稳定性,在定量指标和视觉效果上都达到了最先进的性能。代码可在[此处](this https URL)获取。
https://arxiv.org/abs/2507.06656
Disentangled and interpretable latent representations in generative models typically come at the cost of generation quality. The $\beta$-VAE framework introduces a hyperparameter $\beta$ to balance disentanglement and reconstruction quality, where setting $\beta > 1$ introduces an information bottleneck that favors disentanglement over sharp, accurate reconstructions. To address this trade-off, we propose a novel generative modeling framework that leverages a range of $\beta$ values to learn multiple corresponding latent representations. First, we obtain a slew of representations by training a single variational autoencoder (VAE), with a new loss function that controls the information retained in each latent representation such that the higher $\beta$ value prioritize disentanglement over reconstruction fidelity. We then, introduce a non-linear diffusion model that smoothly transitions latent representations corresponding to different $\beta$ values. This model denoises towards less disentangled and more informative representations, ultimately leading to (almost) lossless representations, enabling sharp reconstructions. Furthermore, our model supports sample generation without input images, functioning as a standalone generative model. We evaluate our framework in terms of both disentanglement and generation quality. Additionally, we observe smooth transitions in the latent spaces with respect to changes in $\beta$, facilitating consistent manipulation of generated outputs.
在生成模型中,解耦和可解释的潜在表示通常会以生成质量为代价。$\beta$-VAE框架通过引入超参数$\beta$来平衡解耦与重构质量,在设置$\beta > 1$时,它创建了一个信息瓶颈,更倾向于解耦而不是精确、清晰的重建。为了应对这一权衡问题,我们提出了一种新的生成模型框架,该框架利用一系列$\beta$值来学习多个对应的潜在表示。 首先,通过训练单一的变分自编码器(VAE)并使用一个新的损失函数,我们在每个潜在表示中控制保留的信息量,使得更高的$\beta$值优先考虑解耦而非重构保真度,从而获得大量的表示。随后,我们引入了一个非线性扩散模型,该模型平滑地转换对应不同$\beta$值的潜在表示。此模型去噪并转向较少解耦但更具有信息性的表示,最终导致(几乎)无损表示,使得重建更加清晰。 此外,我们的模型支持在没有输入图像的情况下生成样本,从而作为一个独立的生成模型运作。我们在解耦度和生成质量两方面评估了该框架的表现,并观察到潜在空间中随着$\beta$值的变化呈现出平滑过渡的现象,这有助于对生成输出进行一致的操作与调整。
https://arxiv.org/abs/2507.06613
Text-to-image diffusion models (T2I DMs), represented by Stable Diffusion, which generate highly realistic images based on textual input, have been widely used. However, their misuse poses serious security risks. While existing concept unlearning methods aim to mitigate these risks, they struggle to balance unlearning effectiveness with generative this http URL overcome this limitation, we innovatively propose the Key Step Concept Unlearning (KSCU) method, which ingeniously capitalizes on the unique stepwise sampling characteristic inherent in diffusion models during the image generation process. Unlike conventional approaches that treat all denoising steps equally, KSCU strategically focuses on pivotal steps with the most influence over the final outcome by dividing key steps for different concept unlearning tasks and fine-tuning the model only at those steps. This targeted approach reduces the number of parameter updates needed for effective unlearning, while maximizing the retention of the model's generative this http URL extensive benchmark experiments, we demonstrate that KSCU effectively prevents T2I DMs from generating undesirable images while better retaining the model's generative this http URL code will be released.
文本到图像的扩散模型(T2I DM),如Stable Diffusion,能够根据文本输入生成高度逼真的图片,在广泛应用的同时也带来了严重的安全风险。现有的概念卸载方法虽然旨在缓解这些风险,但往往难以在卸载效果和模型生成能力之间取得平衡。为克服这一局限性,我们创新地提出了关键步骤概念卸载(Key Step Concept Unlearning, KSCU)方法,巧妙利用了扩散模型在图像生成过程中特有的逐步采样特性。与传统做法不同的是,KSCU不平等地对待所有的去噪步骤,而是战略性地聚焦于对最终结果影响最大的关键步骤,并针对不同的概念卸载任务细分这些关键步骤,在这些特定的步骤上进行微调。这种有针对性的方法减少了有效卸载所需的参数更新数量,同时最大化了模型生成能力的保留。通过广泛的基准实验,我们证明KSCU可以有效地防止T2I DM生成不希望出现的图像,并且在一定程度上更好地保持了模型的生成性能。相关代码将公开发布。
https://arxiv.org/abs/2507.06526
Reconstructing ocean dynamics from observational data is fundamentally limited by the sparse, irregular, and Lagrangian nature of spatial sampling, particularly in subsurface and remote regions. This sparsity poses significant challenges for forecasting key phenomena such as eddy shedding and rogue waves. Traditional data assimilation methods and deep learning models often struggle to recover mesoscale turbulence under such constraints. We leverage a deep learning framework that combines neural operators with denoising diffusion probabilistic models (DDPMs) to reconstruct high-resolution ocean states from extremely sparse Lagrangian observations. By conditioning the generative model on neural operator outputs, the framework accurately captures small-scale, high-wavenumber dynamics even at $99\%$ sparsity (for synthetic data) and $99.9\%$ sparsity (for real satellite observations). We validate our method on benchmark systems, synthetic float observations, and real satellite data, demonstrating robust performance under severe spatial sampling limitations as compared to other deep learning baselines.
从观测数据重建海洋动力学受到空间采样稀疏、不规则和拉格朗日性质的限制,特别是在次表层和偏远地区。这种稀疏性对预测诸如涡旋脱落和巨浪等关键现象构成了重大挑战。传统数据同化方法和深度学习模型在这些约束条件下往往难以恢复中尺度湍流。我们利用一种结合神经算子与去噪扩散概率模型(DDPMs)的深度学习框架,从极度稀疏的拉格朗日观测中重建高分辨率海洋状态。通过将生成模型以神经算子输出为条件,该框架能够在99%的数据缺失率(对于合成数据)和99.9%的数据缺失率(对于实际卫星观测数据)下准确捕捉小尺度、高频动力学。 我们在基准系统、合成浮标观测以及真实卫星数据上验证了我们的方法,并证明在极端空间采样限制条件下,相比其他深度学习基线模型表现出色。
https://arxiv.org/abs/2507.06479
Low-dose computed tomography (LDCT) reduces radiation exposure but often degrades image quality, potentially compromising diagnostic accuracy. Existing deep learning-based denoising methods focus primarily on pixel-level mappings, overlooking the potential benefits of high-level semantic guidance. Recent advances in vision-language models (VLMs) suggest that language can serve as a powerful tool for capturing structured semantic information, offering new opportunities to improve LDCT reconstruction. In this paper, we introduce LangMamba, a Language-driven Mamba framework for LDCT denoising that leverages VLM-derived representations to enhance supervision from normal-dose CT (NDCT). LangMamba follows a two-stage learning strategy. First, we pre-train a Language-guided AutoEncoder (LangAE) that leverages frozen VLMs to map NDCT images into a semantic space enriched with anatomical information. Second, we synergize LangAE with two key components to guide LDCT denoising: Semantic-Enhanced Efficient Denoiser (SEED), which enhances NDCT-relevant local semantic while capturing global features with efficient Mamba mechanism, and Language-engaged Dual-space Alignment (LangDA) Loss, which ensures that denoised images align with NDCT in both perceptual and semantic spaces. Extensive experiments on two public datasets demonstrate that LangMamba outperforms conventional state-of-the-art methods, significantly improving detail preservation and visual fidelity. Remarkably, LangAE exhibits strong generalizability to unseen datasets, thereby reducing training costs. Furthermore, LangDA loss improves explainability by integrating language-guided insights into image reconstruction and offers a plug-and-play fashion. Our findings shed new light on the potential of language as a supervisory signal to advance LDCT denoising. The code is publicly available on this https URL.
低剂量计算机断层扫描(LDCT)减少了辐射暴露,但往往会降低图像质量,从而可能影响诊断准确性。现有的基于深度学习的降噪方法主要集中在像素级别的映射上,忽视了高级语义指导的潜在优势。最近在视觉-语言模型(VLMs)领域的进展表明,语言可以作为捕捉结构化语义信息的强大工具,在改善LDCT重建方面提供了新的机会。 本文介绍了LangMamba框架,这是一种利用从VLM衍生表示中提取的信息增强正常剂量CT(NDCT)监督的基于语言驱动的Mamba框架。用于低剂量CT降噪。LangMamba遵循两阶段学习策略。首先,我们预训练一个由冻结的视觉-语言模型引导的自动编码器(LangAE),将NDCT图像映射到充满解剖信息的语义空间中。其次,我们将LangAE与两个关键组件协同工作以指导LDCT降噪:语义增强高效降噪器(SEED),该组件在捕捉全局特征的同时利用高效的Mamba机制来增强与NDCT相关的局部语义;以及语言参与双空间对齐损失(LangDA Loss),它确保去噪后的图像在感知和语义空间中与NDCT保持一致。 在两个公开数据集上的广泛实验表明,LangMamba优于传统的最先进的方法,在细节保留和视觉保真度方面有显著提升。值得注意的是,LangAE展示了强大的泛化能力,能够应用于未见过的数据集上,并因此减少了训练成本。此外,通过将语言引导的洞察融入图像重建中,LangDA损失增强了可解释性,并提供了插件即用的方式。 我们的研究结果为利用语言作为监督信号以推进低剂量CT降噪的可能性开辟了新的视角。代码可在[此处](https://this-URL/)公开获取。
https://arxiv.org/abs/2507.06140
Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.
尽管深度学习在各个领域取得了成功,但它仍然容易受到对抗性攻击的威胁。虽然许多现有的对抗性攻击方法能够实现较高的成功率,但它们通常依赖于$\ell_{p}$-范数扰动约束条件,这并不符合人类的感知能力。因此,研究人员将注意力转向了生成自然且不受限制的对抗样本(UAEs)。基于GAN的方法面临诸如图像质量不佳、由于不稳定性和模式崩溃等问题的固有局限性。与此同时,虽然扩散模型也被用于生成UAE,但它们仍然依赖于迭代PGD扰动注入方法,并未充分利用其核心去噪能力。 在本文中,我们提出了一种基于扩散模型的新颖方法——ScoreAdv,用于生成不受限制的自然对抗样本。该方法结合了一个可解释的对抗性指导机制,逐步将采样分布向对抗性分布转移,并使用一个可解释的显著图(saliency map)将参考图像中的视觉信息注入到生成的样本中。值得注意的是,我们的方法能够生成无限数量的自然对抗样本,并能攻击分类模型和检索模型。 我们在ImageNet和CelebA数据集上进行了广泛的实验,在黑盒和白盒设置下验证了ScoreAdv在针对十种目标模型时的表现。实验结果表明,ScoreAdv实现了目前最先进的攻击成功率和图像质量水平。此外,去噪与对抗性扰动之间的动态平衡使ScoreAdv即使在防御措施面前也能保持其鲁棒性。
https://arxiv.org/abs/2507.06078
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.
全景场景图生成(PSG)将实例分割与关系理解结合,以捕捉复杂场景中的像素级别结构化关系。尽管最近利用预训练的视觉语言模型(VLMs)的方法在开放词汇设置中显著提高了性能,但这些方法通常忽略了VLMs在空间关系推理方面的内在局限性,例如难以区分物体的相对位置,这导致了次优的关系预测结果。受到去噪扩散模型反转过程在保留输入图像的空间结构方面的作用启发,我们提出了SPADE(SPatial-Aware Denoising-nEtwork)框架——一种针对开放词汇PSG的新方法。SPADE包含两个关键步骤:(1)基于反转指导的UNet适应校准;和(2)空间感知上下文推理。 在第一步中,我们通过轻量级LoRA微调策略将通用预训练教师扩散模型调整为特定于PSG的去噪网络,并使用反演过程中生成的跨注意力图。第二步中,我们开发了一种基于空间感知的关系图变换器,该变换器能够捕捉局部和长距离的上下文信息,从而促进高质量关系查询的生成。 在基准PSG和Visual Genome数据集上的广泛实验表明,在封闭设置和开放集合场景下,SPADE超越了现有最先进方法的表现,特别是在空间关系预测方面。
https://arxiv.org/abs/2507.05798
Diffusion policies have become increasingly popular in robot learning due to their reliable convergence in motion generation tasks. At a high level, these policies learn to transform noisy action trajectories into effective ones, conditioned on observations. However, each time such a model is trained in a robotics context, the network must relearn fundamental spatial representations and operations, such as translations and rotations, from scratch in order to ground itself and operate effectively in a 3D environment. Incorporating geometric inductive biases directly into the network can alleviate this redundancy and substantially improve training efficiency. In this paper, we introduce hPGA-DP, a diffusion policy approach that integrates a mathematical framework called Projective Geometric Algebra (PGA) to embed strong geometric inductive biases. PGA is particularly well-suited for this purpose as it provides a unified algebraic framework that naturally encodes geometric primitives, such as points, directions, and rotations, enabling neural networks to reason about spatial structure through interpretable and composable operations. Specifically, we propose a novel diffusion policy architecture that incorporates the Projective Geometric Algebra Transformer (P-GATr), leveraging its E(3)-equivariant properties established in prior work. Our approach adopts a hybrid architecture strategy, using P-GATr as both a state encoder and action decoder, while employing U-Net or Transformer-based modules for the denoising process. Several experiments and ablation studies in both simulated and real-world environments demonstrate that hPGA-DP not only improves task performance and training efficiency through the geometric bias of P-GATr, but also achieves substantially faster convergence through its hybrid model compared to architectures that rely solely on P-GATr.
扩散策略在机器人学习领域由于其在运动生成任务中的可靠收敛性而变得越来越受欢迎。总体而言,这些策略能够学会将含噪的动作轨迹转化为有效的动作轨迹,并且这一过程是以观察为条件的。然而,在每次训练一个这样的模型以用于机器人学时,网络必须从头开始重新学习基础的空间表示和操作(如平移和旋转),以便在三维环境中扎根并有效运作。直接在网络中引入几何归纳偏置可以缓解这种冗余性,并显著提高训练效率。 在这篇论文中,我们介绍了hPGA-DP这一扩散策略方法,它整合了一个名为投影几何代数(PGA)的数学框架,以嵌入强大的几何归纳偏置。PGA非常适合于此用途,因为它提供了一种统一的代数框架,该框架自然地编码了几何基本元素(如点、方向和旋转),使神经网络能够通过解释性和可组合的操作来推理空间结构。 具体而言,我们提出了一种新的扩散策略架构,其中包括投影几何代数变换器(P-GATr),并利用了先前工作中建立的E(3)-等变特性。我们的方法采用了一种混合架构策略,在该策略中,P-GATr同时用作状态编码器和动作解码器,并使用U-Net或基于Transformer的模块进行去噪过程。 在模拟环境和真实世界环境中进行的多项实验和消融研究表明,hPGA-DP不仅通过P-GATr的几何偏置提升了任务性能和训练效率,而且还比仅依赖于P-GATr的架构实现了更快的收敛速度。
https://arxiv.org/abs/2507.05695