Pose-Guided Person Image Synthesis (PGPIS) generates images that maintain a subject's identity from a source image while adopting a specified target pose (e.g., skeleton). While diffusion-based PGPIS methods effectively preserve facial features during pose transformation, they often struggle to accurately maintain clothing details from the source image throughout the diffusion process. This limitation becomes particularly problematic when there is a substantial difference between the source and target poses, significantly impacting PGPIS applications in the fashion industry where clothing style preservation is crucial for copyright protection. Our analysis reveals that this limitation primarily stems from the conditional diffusion model's attention modules failing to adequately capture and preserve clothing patterns. To address this limitation, we propose human-parsing-guided attention diffusion, a novel approach that effectively preserves both facial and clothing appearance while generating high-quality results. We propose a human-parsing-aware Siamese network that consists of three key components: dual identical UNets (TargetNet for diffusion denoising and SourceNet for source image embedding extraction), a human-parsing-guided fusion attention (HPFA), and a CLIP-guided attention alignment (CAA). The HPFA and CAA modules can embed the face and clothes patterns into the target image generation adaptively and effectively. Extensive experiments on both the in-shop clothes retrieval benchmark and the latest in-the-wild human editing dataset demonstrate our method's significant advantages over 13 baseline approaches for preserving both facial and clothes appearance in the source image.
姿势引导的人体图像合成(PGPIS)生成的图像是在保持源图像中主体身份的同时,采用指定的目标姿态(如骨骼)。虽然基于扩散的方法在姿态变换过程中有效保留面部特征,但它们往往难以准确地在整个扩散过程中维持来自源图像的服装细节。这种限制尤其在源姿势与目标姿势差异显著时变得严重,这影响了PGPIS在时尚行业中的应用,因为在这些行业中,保持服装风格对于版权保护至关重要。 我们的分析表明,这一局限主要源于条件扩散模型的注意力模块未能充分捕捉和保留服装图案。为解决此问题,我们提出了一种新颖的方法——人体分割引导的注意扩散(human-parsing-guided attention diffusion),该方法在生成高质量图像的同时,有效保持了面部和服装外观。 我们设计了一个人体分割感知的Siamese网络,包括三个关键组件:两个相同的双UNet架构(TargetNet用于扩散去噪,SourceNet用于提取源图像嵌入)、一个人体分割引导融合注意机制(HPFA)以及一个CLIP引导注意力对齐模块(CAA)。通过这些模块,面部和服装图案可以被灵活有效地嵌入到目标图中。 在商店内衣物检索基准测试及最新的野外人体编辑数据集上的广泛实验表明,在保持源图像中的面部和服装外观方面,我们提出的方法相比13种基线方法具有显著优势。
https://arxiv.org/abs/2502.03426
Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at this https URL.
九自由度(9-DoF)物体姿态和尺寸估计对于增强现实和机器人操作至关重要。由于其潜在的跨同类未知对象泛化能力,类别级别的方法已经得到了广泛的研究关注。然而,这些方法需要手动收集并标注大规模的真实世界训练数据。为了解决这一问题,我们引入了一种基于扩散模型的方法来进行领域通用类别的9-DoF物体姿态估计。我们的动机是利用扩散模型的隐式泛化能力来解决对象姿态估计中的域泛化挑战。这包括仅使用渲染的合成数据进行模型训练以实现对现实场景的泛化。我们提出了一种有效的扩散模型,从生成的角度重新定义了9-DoF物体姿态估计。我们的模型在训练和推理过程中不需要任何3D形状先验。通过采用去噪扩散隐式模型,我们展示了逆向扩散过程可以在最多三步内完成,实现接近实时的性能。最后,我们设计了一个机器人抓取系统,包括硬件和软件组件。通过对两个基准数据集以及现实世界的机器人系统的全面实验,我们证明了我们的方法在领域泛化性能上达到了最先进的水平。我们的代码将在以下网址公开:[提供链接](请将方括号中的内容替换为实际的URL)。
https://arxiv.org/abs/2502.02525
Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.
当前的人脸编辑方法主要依赖于基于GAN的技术,但最近的研究焦点转向了扩散模型(diffusion-based models),因为这些模型在图像重建方面表现出色。然而,扩散模型仍然面临着挑战,即难以精确操纵细粒度属性并保持不应改变的属性的一致性。为了应对这些问题,并促进人脸图像编辑的便利性,我们提出了一种新的方法,该方法利用稳定扩散(Stable-Diffusion)模型和粗糙3D面部模型的力量来控制肖像照片中的光照、面部表情及头部姿态。我们观察到,这项任务本质上涉及目标背景、身份以及不同面部属性的组合。我们的目标是充分解耦这些因素的控制,以实现高质量的人脸编辑。具体来说,我们的方法,命名为RigFace,包含: 1. **空间属性编码器**(Spatial Attribute Encoder):提供精确且独立于其他因素的背景、姿势、表情和光照条件。 2. **身份编码器**(Identity Encoder):将身份特征转移到预训练稳定扩散模型中的去噪UNet中。 3. **属性控制器**(Attribute Rigger):将这些条件注入到去噪UNet中。 我们的模型在与现有面部编辑模型相比时,在身份保持和照片逼真度方面达到了相当甚至更好的性能。
https://arxiv.org/abs/2502.02465
Despite the groundbreaking success of diffusion models in generating high-fidelity images, their latent space remains relatively under-explored, even though it holds significant promise for enabling versatile and interpretable image editing capabilities. The complicated denoising trajectory and high dimensionality of the latent space make it extremely challenging to interpret. Existing methods mainly explore the feature space of U-Net in Diffusion Models (DMs) instead of the latent space itself. In contrast, we directly investigate the latent space via Singular Value Decomposition (SVD) and discover three useful properties that can be used to control generation results without the requirements of data collection and maintain identity fidelity generated images. Based on these properties, we propose a novel image editing framework that is capable of learning arbitrary attributes from one pair of latent codes destined by text prompts in Stable Diffusion Models. To validate our approach, extensive experiments are conducted to demonstrate its effectiveness and flexibility in image editing. We will release our codes soon to foster further research and applications in this area.
尽管扩散模型在生成高保真图像方面取得了突破性成功,但其潜在空间仍然相对未被充分探索。虽然这一领域具有实现多功能且可解释的图像编辑功能的巨大潜力,但由于复杂的去噪路径和高维度特性,解读该空间极其困难。现有的方法主要集中在探究扩散模型(DMs)中U-Net的特征空间上,而不是直接研究潜在空间本身。 相比之下,我们通过奇异值分解(SVD)直接对潜在空间进行了调查,并发现了三种有用的属性,这些属性可以在无需收集数据的情况下用来控制生成结果并保持身份保真度。基于这些特性,我们提出了一种新颖的图像编辑框架,能够从一对由文本提示引导的稳定扩散模型中的潜在代码中学习任意属性。为了验证我们的方法的有效性和灵活性,我们进行了大量的实验来展示其在图像编辑方面的优势。我们将很快发布我们的代码以促进该领域的进一步研究和应用。
https://arxiv.org/abs/2502.02225
Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.
扩散先验已经被用来通过在图像恢复数据集上微调扩散模型(DMs)来进行盲人脸恢复(BFR),以恢复低质量的图像。然而,简单地应用DMs存在几个关键限制。(i)扩散先验在语义一致性方面较差(例如,在身份识别、结构和颜色等方面),增加了优化BFR模型的难度;(ii)依赖于数百次去噪迭代,这阻碍了与感知损失的有效合作,这对于忠实恢复至关重要。观察到潜在一致模型(LCM)在ODE轨迹上学习了一致性噪声到数据映射,并因此显示出更强的身份识别、结构信息和颜色保持的语义一致性,我们提出了InterLCM来利用LCM的卓越语义一致性和效率以解决上述问题。将低质量图像视为LCM中的中间状态,InterLCM通过从早期的LCM步骤开始,在保真度与质量之间达到了平衡。LCM还允许在训练期间集成感知损失,从而提高了恢复的质量,特别是在现实世界的场景中。为了缓解结构和语义不确定性,InterLCM整合了一个视觉模块来提取视觉特征以及一个空间编码器来捕捉空间细节,提升了恢复图像的保真度。广泛的实验表明,InterLCM在合成数据集和真实世界数据集中均优于现有方法,并且实现了更快的推理速度。
https://arxiv.org/abs/2502.02215
In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) -- 3.4% for exact matches and 57% for extractive memorization -- and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers' superior performance and their fault modes, thereby requiring active monitoring.
在这项工作中,我们探讨了教师神经机器翻译(NMT)模型中的实例级记忆如何通过序列级别的知识蒸馏(SeqKD)传递给学生模型。研究发现,尽管学生模型没有直接接触原始训练数据,但它们的记忆量比基线模型(与之大小相同,在原始数据上进行训练的模型)更多——精确匹配多出3.4%,提取式记忆多出57%——并且表现出更高的幻觉率。此外,在这种SeqKD设置下,我们还描述了学生模型在特定训练数据子集上的行为表现,例如质量低下的子组和具有特定反事实记忆(CM)得分的子组,并发现学生模型对低质量子组显示出更强的去噪效果。最后,我们提出了一种名为自适应SeqKD的SeqKD改进方法,通过干预减少学生的记忆量和幻觉现象。总体而言,我们在应用SeqKD时建议谨慎行事:学生不仅继承了教师的优点,也继承了他们的缺陷模式,因此需要进行积极监控。
https://arxiv.org/abs/2502.01491
We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
我们提出了一种基于去噪扩散模型(DDMs)的新型生成方法,该方法能够产生高质量的图像样本及其无损压缩比特流表示。这一成果是通过在反向扩散过程中用预定义代码本中的固定独立同分布高斯向量噪声样本集替换标准高斯噪声采样实现的。令人惊讶的是,我们发现我们的方法(称为去噪扩散代码模型DDCM)即使对于极小的代码库也能保持与标准DDMs相当的样本质量和多样性。通过利用DDCM并从代码本中选择最符合给定图像的噪声,我们可以将生成模型转化为一种高效的有损图像编解码器,从而达到最先进的感知图像压缩效果。更一般地,通过设置其他噪声选取规则,我们将我们的压缩方法扩展到任何条件下的图像生成任务(例如,图像恢复),在这种情况下,生成的图像与其紧凑的比特流表示同时产生。我们的工作还伴随着对所提出的压缩条件下生成方案的数学解释,建立了一种与所考虑任务后验抽样器评分近似之间的联系。
https://arxiv.org/abs/2502.01189
Seismic data often face challenges in their utilization due to noise contamination, incomplete acquisition, and limited low-frequency information, which hinder accurate subsurface imaging and interpretation. Traditional processing methods rely heavily on task-specific designs to address these challenges and fail to account for the variability of data. To address these limitations, we present a generative seismic foundation model (GSFM), a unified framework based on generative diffusion models (GDMs), designed to tackle multi-task seismic processing challenges, including denoising, backscattered noise attenuation, interpolation, and low-frequency extrapolation. GSFM leverages a pre-training stage on synthetic data to capture the features of clean, complete, and broadband seismic data distributions and applies an iterative fine-tuning strategy to adapt the model to field data. By adopting a target-oriented diffusion process prediction, GSFM improves computational efficiency without compromising accuracy. Synthetic data tests demonstrate GSFM surpasses benchmarks with equivalent architectures in all tasks and achieves performance comparable to traditional pre-training strategies, even after their fine-tuning. Also, field data tests suggest that our iterative fine-tuning approach addresses the generalization limitations of conventional pre-training and fine-tuning paradigms, delivering significantly enhanced performance across diverse tasks. Furthermore, GSFM's inherent probabilistic nature enables effective uncertainty quantification, offering valuable insights into the reliability of processing results.
地震数据在利用过程中常面临噪声污染、采集不完整和低频信息有限等挑战,这些问题会阻碍准确的地下成像与解释。传统的处理方法依赖于特定任务的设计来应对这些挑战,并且未能考虑数据的变化性。为解决这些限制,我们提出了一种基于生成扩散模型(Generative Diffusion Models, GDMs)的地震基础模型(GSFM),这是一个统一框架,旨在解决包括去噪、背向散射噪声抑制、插值和低频外推在内的多任务地震处理挑战。 GSFM通过在合成数据上进行预训练来捕捉干净且完整的宽带地震数据分布特征,并采用迭代微调策略使模型适应现场数据。利用目标导向的扩散过程预测,GSFM提升了计算效率而不牺牲准确性。基于合成数据的测试表明,在所有任务中,GSFM优于具有相同架构的传统基准方法,并且即使经过传统预训练和微调后的性能也与其相当。此外,实际数据测试显示我们的迭代微调方法解决了常规预训练和微调范式的泛化限制问题,提高了不同任务中的性能表现。 另外,由于GSFM内在的概率特性,它能够进行有效的不确定性量化,为处理结果的可靠性提供有价值的见解。
https://arxiv.org/abs/2502.01111
Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: this https URL.
将扩散模型与下游任务对齐通常需要微调新模型或在推理时使用基于梯度的指导,以便从奖励倾斜后验中采样。在这项工作中,我们探索了一种简单的无梯度推理时间引导方法,称为受控去噪(Controlled Denoising, CoDe),这种方法避免了可微分引导函数和模型微调的需求。CoDe是一种在中间去噪步骤应用的块状抽样方法,允许与下游奖励对齐。我们的实验表明,尽管CoDe简单,但它在奖励对齐、指令遵循以及推理成本之间提供了一个有利的权衡,并且其性能可以与最先进的基线相媲美。我们的代码可在以下网址获取:[此URL]。
https://arxiv.org/abs/2502.00968
As digital technologies advance, communication networks face challenges in handling the vast data generated by intelligent devices. Autonomous vehicles, smart sensors, and IoT systems necessitate new paradigms. This thesis addresses these challenges by integrating semantic communication and generative models for optimized image compression and edge network resource allocation. Unlike bit-centric systems, semantic communication prioritizes transmitting meaningful data specifically selected to convey the meaning rather than obtain a faithful representation of the original data. The communication infrastructure can benefit to significant improvements in bandwidth efficiency and latency reduction. Central to this work is the design of semantic-preserving image compression using Generative Adversarial Networks and Denoising Diffusion Probabilistic Models. These models compress images by encoding only semantically relevant features, allowing for high-quality reconstruction with minimal transmission. Additionally, a Goal-Oriented edge network optimization framework is introduced, leveraging the Information Bottleneck principle and stochastic optimization to dynamically allocate resources and enhance efficiency. By integrating semantic communication into edge networks, this approach balances computational efficiency and communication effectiveness, making it suitable for real-time applications. The thesis compares semantic-aware models with conventional image compression techniques using classical and semantic evaluation metrics. Results demonstrate the potential of combining generative AI and semantic communication to create more efficient semantic-goal-oriented communication networks that meet the demands of modern data-driven applications.
随着数字技术的进步,通信网络在处理智能设备产生的大量数据时面临挑战。自主车辆、智能传感器和物联网系统需要新的解决方案。本文通过整合语义通信与生成模型来解决这些挑战,旨在优化图像压缩和边缘网络资源分配。不同于位中心的系统,语义通信优先传输有意义的数据,而非追求原始数据的真实再现。这种通信基础设施可以显著提高带宽效率并减少延迟。 这项工作的核心是利用生成对抗网络(GAN)和去噪扩散概率模型进行语义保持的图像压缩设计。这些模型通过编码仅与语义相关的特征来进行图像压缩,使得在最小传输的情况下仍能实现高质量重构。此外,本文还引入了一个以目标为导向的边缘网络优化框架,该框架运用信息瓶颈原理和随机优化方法来动态分配资源并提升效率。 通过将语义通信整合到边缘网络中,这种方案能够平衡计算效率与通信效果,使其适用于实时应用需求。在论文的研究过程中,作者使用经典评估指标及语义评估指标对语义感知模型进行了对比测试,与传统图像压缩技术进行比较。实验结果表明,在现代数据驱动应用程序的需求下,结合生成式AI和语义通信能够创建出更加高效的以目标为导向的通信网络。 总体来说,这项研究展示了如何通过创新性地融合新兴的技术手段来应对智能设备带来的挑战,并为未来的发展提供了方向和思路。
https://arxiv.org/abs/2502.01675
The goal of diffusion generative models is to align the learned distribution with the real data distribution through gradient score matching. However, inherent limitations in training data quality, modeling strategies, and architectural design lead to inevitable gap between generated outputs and real data. To reduce this gap, we propose Weak-to-Strong Diffusion (W2SD), a novel framework that utilizes the estimated difference between existing weak and strong models (i.e., weak-to-strong difference) to approximate the gap between an ideal model and a strong model. By employing a reflective operation that alternates between denoising and inversion with weak-to-strong difference, we theoretically understand that W2SD steers latent variables along sampling trajectories toward regions of the real data distribution. W2SD is highly flexible and broadly applicable, enabling diverse improvements through the strategic selection of weak-to-strong model pairs (e.g., DreamShaper vs. SD1.5, good experts vs. bad experts in MoE). Extensive experiments demonstrate that W2SD significantly improves human preference, aesthetic quality, and prompt adherence, achieving SOTA performance across various modalities (e.g., image, video), architectures (e.g., UNet-based, DiT-based, MoE), and benchmarks. For example, Juggernaut-XL with W2SD can improve with the HPSv2 winning rate up to 90% over the original results. Moreover, the performance gains achieved by W2SD markedly outweigh its additional computational overhead, while the cumulative improvements from different weak-to-strong difference further solidify its practical utility and deployability.
扩散生成模型的目标是通过梯度分数匹配使学习到的分布与真实数据分布对齐。然而,训练数据质量、建模策略和架构设计等方面的固有限制导致了生成输出与真实数据之间不可避免的差距。为了缩小这一差距,我们提出了弱至强扩散(Weak-to-Strong Diffusion, W2SD),这是一种新颖的框架,利用现有弱模型和强模型之间的估计差异来逼近理想模型与强模型间的差距。通过交替进行去噪和反转操作并结合弱至强差异,理论上可以理解W2SD引导潜在变量沿着采样轨迹向真实数据分布区域移动。 W2SD具有高度灵活性和广泛适用性,能够通过策略选择不同类型的弱至强模型对(例如DreamShaper vs SD1.5、MoE中的优秀专家vs不良专家)实现多样化的改进。广泛的实验表明,W2SD显著提高了人类偏好度、美学质量和提示一致性,并在各种模态(如图像、视频)、架构(基于UNet的、DiT-based、MoE)和基准测试中实现了最先进的性能。例如,使用W2SD的Juggernaut-XL可以将HPSv2获胜率提高至原始结果之上90%。 此外,通过W2SD实现的性能提升显著超过了其额外计算开销,并且从不同弱至强差异累计而来的改进进一步巩固了其实用性和可部署性。
https://arxiv.org/abs/2502.00473
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at this https URL
扩散模型在生成任务中,尤其是文本到图像合成领域,已经取得了革命性的进展;然而,它们的迭代去噪过程需要大量的计算资源。在这篇论文中,我们提出了一种新的加速策略,该策略结合了令牌级剪枝与缓存技术来应对这一计算挑战。通过采用噪声相对幅度,我们在去噪迭代过程中识别出重要的令牌变化。此外,我们还通过引入空间聚类和确保分布平衡的方法来优化令牌选择。我们的实验表明,在保持模型性能的同时,计算成本减少了50%-60%,从而显著提高了扩散模型的效率。代码可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2502.00433
Federated learning (FL) has gained widespread attention for its privacy-preserving and collaborative learning capabilities. Due to significant statistical heterogeneity, traditional FL struggles to generalize a shared model across diverse data domains. Personalized federated learning addresses this issue by dividing the model into a globally shared part and a locally private part, with the local model correcting representation biases introduced by the global model. Nevertheless, locally converged parameters more accurately capture domain-specific knowledge, and current methods overlook the potential benefits of these parameters. To address these limitations, we propose PM-MoE architecture. This architecture integrates a mixture of personalized modules and an energy-based personalized modules denoising, enabling each client to select beneficial personalized parameters from other clients. We applied the PM-MoE architecture to nine recent model-split-based personalized federated learning algorithms, achieving performance improvements with minimal additional training. Extensive experiments on six widely adopted datasets and two heterogeneity settings validate the effectiveness of our approach. The source code is available at \url{this https URL}.
联邦学习(FL)因其保护隐私和促进协作学习的能力而获得了广泛关注。然而,由于显著的统计异质性,传统的联邦学习在跨多样化的数据域泛化共享模型方面存在困难。个性化联邦学习通过将模型分为一个全局共享部分和一个本地私有部分来解决这一问题,其中本地模型纠正了全局模型引入的表现偏差。尽管如此,局部收敛参数更准确地捕捉到了领域特定的知识,而当前的方法忽视了这些参数的潜在好处。为了解决这些问题,我们提出了PM-MoE架构。该架构整合了一组个性化的混合模块和基于能量的个性化模块去噪技术,使每个客户端能够从其他客户端选择有益的个性化参数。我们将PM-MoE架构应用于九种近期基于模型拆分的个性化联邦学习算法中,在几乎不需要额外训练的情况下实现了性能提升。在六个广泛采用的数据集和两种异质性设置上的大量实验验证了我们方法的有效性。源代码可在\url{this https URL}获取。
https://arxiv.org/abs/2502.00354
In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: this https URL.
在人机交互和心理评估领域,语音情感识别(SER)通过解析来自语音信号的情感状态发挥着重要作用。尽管取得了进展,但由于系统复杂性、特征区分度问题以及噪声干扰等因素,依然存在挑战。本文介绍了一种新的端到端(E2E)深度学习多分辨率框架用于解决SER中的这些限制,该框架直接从原始波形语音信号中提取有意义的表示。通过利用快速离散小波变换(FDWT)的特性,包括级联算法、共轭正交滤波器和系数去噪,我们的方法引入了一个可通过深度学习技术进行学习的小波基和去噪模型。该框架结合了用于可学不对称硬阈值处理小波系数的激活函数。我们利用小波在时间和频率域中有效定位的能力。随后,我们将一维膨胀卷积神经网络(1D膨胀CNN)与空间注意层以及带有时间注意层的双向门控递归单元(Bi-GRU)相结合,以有效地捕捉情感特征中的细微空间和时间特性。通过处理不同长度的语音而无需进行分割,并且不需要预处理或后处理,所提出的模型在IEMOCAP和EMO-DB数据集上优于现有方法。 本文代码发布于Github仓库:[此链接](https://this.url/github/repo)(请注意将URL替换为实际存储库地址)。
https://arxiv.org/abs/2502.00310
Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative refinement, thus resulting in a time-consuming implementation. We propose an efficient method that equips a diffusion model with a lightweight translator, dubbed a Diffusion Model Translator (DMT), to accomplish I2I. Specifically, we first offer theoretical justification that in employing the pioneering DDPM work for the I2I task, it is both feasible and sufficient to transfer the distribution from one domain to another only at some intermediate step. We further observe that the translation performance highly depends on the chosen timestep for domain transfer, and therefore propose a practical strategy to automatically select an appropriate timestep for a given task. We evaluate our approach on a range of I2I applications, including image stylization, image colorization, segmentation to image, and sketch to image, to validate its efficacy and general utility. The comparisons show that our DMT surpasses existing methods in both quality and efficiency. Code will be made publicly available.
将扩散模型应用于图像到图像的翻译(I2I)因其实际应用价值而日益受到关注。以往的方法通过在每次去噪步骤中注入来自源图像的信息来进行迭代细化,从而导致耗时的实现过程。我们提出了一种有效方法,即为扩散模型配备一个轻量级转换器,称为“扩散模型转换器”(DMT),以完成I2I任务。 具体来说,我们首先提供了理论依据,在使用开创性的DDPM工作进行I2I任务时,只需在某些中间步骤上将分布从一个领域转移到另一个领域就既可行又足够。我们进一步观察到,翻译性能高度依赖于选择的域转换时间步,并因此提出了一种实用策略以自动为给定任务选择合适的时序。我们在包括图像风格化、图像着色、分割到图像和草图到图像等多种I2I应用上评估了我们的方法,以验证其有效性和通用性。 通过比较显示,与现有的方法相比,DMT在质量和效率方面都表现出超越优势。代码将公开发布。
https://arxiv.org/abs/2502.00307
Image denoising of low-dose computed tomography (LDCT) is an important problem for clinical diagnosis with reduced radiation exposure. Previous methods are mostly trained with pairs of synthetic or misaligned LDCT and normal-dose CT (NDCT) images. However, trained with synthetic noise or misaligned LDCT/NDCT image pairs, the denoising networks would suffer from blurry structure or motion artifacts. Since non-contrast CT (NCCT) images share the content characteristics to the corresponding NDCT images in a three-phase scan, they can potentially provide useful information for real-world LDCT image denoising. To exploit this aspect, in this paper, we propose to incorporate clean NCCT images as useful guidance for the learning of real-world LDCT image denoising networks. To alleviate the issue of spatial misalignment in training data, we design a new Patch Triplet Similarity Purification (PTSP) strategy to select highly similar patch (instead of image) triplets of LDCT, NDCT, and NCCT images for network training. Furthermore, we modify two image denoising transformers of SwinIR and HAT to accommodate the NCCT image guidance, by replacing vanilla self-attention with cross-attention. On our collected clinical dataset, the modified transformers trained with the data selected by our PTSP strategy show better performance than 15 comparison methods on real-world LDCT image denoising. Ablation studies validate the effectiveness of our NCCT image guidance and PTSP strategy. We will publicly release our data and code.
低剂量计算机断层扫描(LDCT)图像的去噪对于减少辐射暴露下的临床诊断来说是一个重要的问题。以往的方法大多使用合成或错位对齐的低剂量和常规剂量CT(NDCT)图像训练模型。然而,通过这种合成噪声或错位的LDCT/NDCT图像对进行训练后,去噪网络可能会导致结构模糊或运动伪影的问题。由于在三相扫描中非对比CT(NCCT)图像与对应的NDCT图像具有相同的内容特征,它们可能为真实世界中的LDCT图像去噪提供有用的信息。 为了利用这一特性,在本文中我们提出了一种方法,即结合干净的NCCT图像作为学习现实世界LDCT图像去噪网络的有效引导。为了避免训练数据中的空间错位问题,设计了一种新的Patch Triplet Similarity Purification (PTSP)策略来选择高度相似的LDCT、NDCT和NCCT图像块三元组进行网络训练。此外,我们修改了两种图像去噪变换器(SwinIR和HAT),通过用交叉注意力替换普通自注意力机制来适应NCCT图像指导。 在我们收集的临床数据集上,经过PTSP策略选择的数据训练的修正后的变换器,在现实世界LDCT图像去噪方面优于15种对比方法。消融研究验证了我们的NCCT图像引导和PTSP策略的有效性。我们将公开发布我们的数据及代码。
https://arxiv.org/abs/2502.00253
Multivariate Time Series Imputation (MTSI) is crucial for many applications, such as healthcare monitoring and traffic management, where incomplete data can compromise decision-making. Existing state-of-the-art methods, like Denoising Diffusion Probabilistic Models (DDPMs), achieve high imputation accuracy; however, they suffer from significant computational costs and are notably time-consuming due to their iterative nature. In this work, we propose CoSTI, an innovative adaptation of Consistency Models (CMs) for the MTSI domain. CoSTI employs Consistency Training to achieve comparable imputation quality to DDPMs while drastically reducing inference times, making it more suitable for real-time applications. We evaluate CoSTI across multiple datasets and missing data scenarios, demonstrating up to a 98% reduction in imputation time with performance on par with diffusion-based models. This work bridges the gap between efficiency and accuracy in generative imputation tasks, providing a scalable solution for handling missing data in critical spatio-temporal systems.
多元时间序列插补(MTSI)对于许多应用至关重要,例如健康监测和交通管理,在这些领域中,不完整数据会损害决策制定。现有的最先进方法,如去噪扩散概率模型(DDPM),实现了高精度的插补;然而,由于其迭代性质,它们面临显著的计算成本并且耗时较长。在这项工作中,我们提出了一种名为CoSTI的新颖方法,它是将一致性模型(CMs)适应于MTSI领域的创新应用。CoSTI利用一致性训练来实现与DDPM相当的插补质量,同时大幅减少推断时间,使其更适合实时应用。我们在多个数据集和缺失数据场景下评估了CoSTI,证明在性能上可以达到基于扩散模型的表现水平的同时,插补时间减少了多达98%。这项工作弥合了生成性插补任务中的效率与准确性之间的差距,为处理关键时空系统中的缺失数据提供了可扩展的解决方案。
https://arxiv.org/abs/2501.19364
The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. When evaluating outputs by using vision language models as a proxy of humans, many previous metrics to quantify the naturalness of video do not always correlate with evaluation and also depend on the degree of dynamic descriptions in evaluation prompts. We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling. We provide practical guidelines on which axes, among search budget, lookahead steps for reward estimate, and denoising steps, in the reverse diffusion process, we should allocate the inference-time computation.
文本到视频扩散模型的显著进步使得生成逼真的视频成为可能,尽管这些生成的视频内容常常包含不自然的动作或变形、倒放以及静止场景。最近,一个对齐问题引起了极大的关注,即我们根据某些衡量内容质量的标准来调整扩散模型的输出。由于在帧方向上感知质量有较大的提升空间,我们应该探讨应该优化哪些指标以及如何进行优化。 本文提出了一种带有前瞻估计器的扩散潜在束搜索方法,在推理时可以选取更好的扩散潜变量以最大化给定的对齐奖励。我们进一步指出,为了改善考虑到与提示语一致性的视频感知质量,需要通过加权现有度量来校准奖励。在使用视觉语言模型作为人类评价代理进行输出评估时,许多先前衡量视频自然度的指标并不总是与评估结果相关,并且依赖于评价提示中动态描述的程度。 我们展示了我们的方法能够在不更新模型参数的情况下,根据校准后的奖励提升感知质量,并生成优于贪心搜索和最佳N次采样的最优生成。我们还提供了实际指南,说明在反向扩散过程中,在搜索预算、前瞻步骤以进行奖励估计以及去噪步骤之间,推理时间的计算应该如何分配。 这种方法提供了一种有效的方式,来提高文本到视频生成的质量,特别是在改善内容与提示语的一致性方面。通过优化这些因素,可以显著提升生成视频的自然度和质量感知。
https://arxiv.org/abs/2501.19252
With the rapid development of wireless communication technology, the efficient utilization of spectrum resources, optimization of communication quality, and intelligent communication have become critical. Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse data hinder accurate reconstruction. To address these issues, we propose the **Radio Map Diffusion Model (RMDM)**, a physics-informed framework that integrates **Physics-Informed Neural Networks (PINNs)** to incorporate constraints like the **Helmholtz equation**. RMDM employs a dual U-Net architecture: the first ensures physical consistency by minimizing PDE residuals, boundary conditions, and source constraints, while the second refines predictions via diffusion-based denoising. By leveraging physical laws, RMDM significantly enhances accuracy, robustness, and generalization. Experiments demonstrate that RMDM outperforms state-of-the-art methods, achieving **NMSE of 0.0031** and **RMSE of 0.0125** under the Static RM (SRM) setting, and **NMSE of 0.0047** and **RMSE of 0.0146** under the Dynamic RM (DRM) setting. These results establish a novel paradigm for integrating physics-informed and data-driven approaches in radio map reconstruction, particularly under sparse data conditions.
随着无线通信技术的快速发展,高效利用频谱资源、优化通信质量和实现智能通信已成为关键任务。无线电地图重建对于支持先进应用至关重要,但复杂的信号传播和稀疏数据等问题阻碍了精确重建。为了解决这些问题,我们提出了**无线电图扩散模型(RMDM)**,这是一个基于物理信息的框架,集成了**物理信息神经网络(PINNs)**来纳入诸如**赫尔姆霍茨方程**等约束条件。RMDM采用双U-Net架构:第一个通过最小化偏微分方程残差、边界条件和源项约束确保物理一致性,而第二个则利用扩散去噪技术优化预测。通过结合物理定律,RMDM显著提升了精度、鲁棒性和泛化能力。实验表明,在静态无线电图(SRM)设置下,RMDM的性能优于现有方法,实现了**NMSE为0.0031和RMSE为0.0125**;而在动态无线电图(DRM)设置下,分别达到了**NMSE为0.0047和RMSE为0.0146**。这些结果确立了一种新的模式,在稀疏数据条件下整合物理信息驱动和数据驱动方法以进行无线电地图重建。
https://arxiv.org/abs/2501.19160
Event-guided imaging has received significant attention due to its potential to revolutionize instant imaging systems. However, the prior methods primarily focus on enhancing RGB images in a post-processing manner, neglecting the challenges of image signal processor (ISP) dealing with event sensor and the benefits events provide for reforming the ISP process. To achieve this, we conduct the first research on event-guided ISP. First, we present a new event-RAW paired dataset, collected with a novel but still confidential sensor that records pixel-level aligned events and RAW images. This dataset includes 3373 RAW images with 2248 x 3264 resolution and their corresponding events, spanning 24 scenes with 3 exposure modes and 3 lenses. Second, we propose a conventional ISP pipeline to generate good RGB frames as reference. This conventional ISP pipleline performs basic ISP operations, this http URL, white balancing, denoising and color space transforming, with a ColorChecker as reference. Third, we classify the existing learnable ISP methods into 3 classes, and select multiple methods to train and evaluate on our new dataset. Lastly, since there is no prior work for reference, we propose a simple event-guided ISP method and test it on our dataset. We further put forward key technical challenges and future directions in RGB-Event ISP. In summary, to the best of our knowledge, this is the very first research focusing on event-guided ISP, and we hope it will inspire the community. The code and dataset are available at: this https URL.
事件引导成像由于其有可能革新即时成像系统而受到了广泛关注。然而,此前的方法主要集中在通过后处理方式提升RGB图像质量上,忽略了图像信号处理器(ISP)在处理事件传感器时面临的挑战以及事件所提供的改进步ISP过程的好处。为了实现这一目标,我们开展了关于事件引导ISP的首次研究。 首先,我们介绍了一套新的事件-RAW配对数据集,该数据集通过一种新颖但依然保密的传感器收集,记录了像素级别的对齐事件和RAW图像。此数据集中包含3373张2248 x 3264分辨率的RAW图片及其对应的事件数据,涵盖24个场景、三种曝光模式以及三个镜头。 其次,我们提出了一条传统的ISP流水线以生成高质量的RGB帧作为参考标准。这条传统ISP流程执行基本的ISP操作,包括白平衡调整、降噪处理和色彩空间转换,并以ColorChecker为参照基准进行校正。 第三,我们将现有的可学习ISP方法分为三大类,并选择多种方法在我们的新数据集上进行训练与评估。 最后,在没有先前工作可以参考的情况下,我们提出了一种简单的事件引导ISP方法,并在此数据集上进行了测试。此外,我们还指出了RGB-Event ISP领域中关键技术挑战和未来研究方向。 总的来说,据我们所知,这是首次专注于事件引导ISP的研究工作,希望它可以激发更多的社区关注与讨论。代码及数据集可在以下网址获取:[此链接](请将"this https URL"替换为实际的URL)。
https://arxiv.org/abs/2501.19129