Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at this https URL
这篇教程提供了关于推理时引导和对齐方法的深入指南,这些方法用于优化扩散模型中的下游奖励函数。虽然扩散模型因其生成建模能力而闻名,但在生物学等领域中的实际应用通常需要生成最大化特定指标(例如蛋白质的稳定性、亲和力以及接近目标结构的程度)的样本。在这些场景中,可以对扩散模型进行调整,使其不仅能生成逼真的样本,还能在推理时明确地最大化所需的度量值而不需微调。本教程探讨了此类推理时间算法的基础方面,并从统一的角度回顾这些方法,表明当前的技术——如基于序列蒙特卡洛(SMC)的引导、基于价值的采样以及分类器引导——旨在近似软优化去噪过程(即RL中的策略),该过程结合了预训练的去噪过程和作为预测函数的价值功能,从中间状态到最终奖励。在此框架内,我们提出了一些尚未在文献中被涵盖的新算法。 此外,本教程还讨论了: 1. 结合推理时间技术的微调方法; 2. 基于搜索算法(如蒙特卡洛树搜索)的推理时间算法,在当前研究中受到了较少关注;以及 3. 语言模型与扩散模型之间在推理时间算法上的联系。 有关蛋白质设计教程代码,请访问此链接:[https URL]
https://arxiv.org/abs/2501.09685
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
We propose a new continuous video modeling framework based on implicit neural representations (INRs) called ActINR. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR's biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including $10\times$ video slow motion, $4\times$ spatial super resolution along with $2\times$ slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.
我们提出了一种基于隐式神经表示(INRs)的新连续视频建模框架,名为ActINR。我们的方法的核心在于观察到INRs可以被视为一种可学习的字典,其中基础函数的形状由INR的权重控制,而它们的位置则由偏置值决定。鉴于紧凑而非线性的激活函数,我们假设INR中的偏置值适合捕捉图像间的运动,并有助于视频序列的紧凑表示。基于这些观察,我们设计了ActINR,在视频序列的不同帧之间共享INR权重,但为每一帧使用独特的偏置值。进一步地,我们将偏置视为受时间索引条件约束的独立INR的输出,以促进平滑性。通过共同训练视频INR和该偏置INR,我们展示了ActINR的独特能力,包括10倍慢动作、4倍空间超分辨率结合2倍慢动作、去噪以及视频修补功能。 在多种视频处理任务中,ActINR表现出色(通常超过6dB的改善),为连续视频建模设定了新的标准。
https://arxiv.org/abs/2501.09277
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
基于预训练的文本到视频模型,先进先出(FIFO)视频扩散技术作为一种无需调整参数即可生成长视频的有效方法最近崭露头角。该技术维护着一个包含逐渐增加噪点的视频帧队列,在队列前端持续产生清晰帧的同时,高斯噪声则在队列尾端被加入。然而,由于缺乏跨帧对应关系建模,FIFO-Diffusion 往往难以维持生成视频中的长时序一致性。 为此,我们提出了 Ouroboros-Diffusion,这是一种新的视频去噪框架,旨在增强结构和内容(主题)的一致性,使任意长度的连贯视频生成成为可能。具体而言,我们在队列尾部引入了一种新的潜在采样技术,以提升结构一致性,确保帧之间的过渡在感知上平滑无间断。 为了进一步提高主题一致性,我们设计了“主题感知跨帧注意力机制”(Subject-Aware Cross-Frame Attention, SACFA),该机制通过在短片段内对齐各帧中的主体对象来实现更好的视觉连贯性。此外,我们还引入了一种自递归引导技术,这种技术利用队列前端所有先前更清晰的帧信息来指导队列尾部较噪点帧的去噪过程,从而促进丰富的上下文全局信息互动。 在 VBench 长视频生成基准测试中的大量实验结果表明,我们的 Ouroboros-Diffusion 在主题一致性、运动平滑度和时序一致性方面表现出了显著的优势。
https://arxiv.org/abs/2501.09019
This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles. Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development. Both methods outperform traditional approaches which rely on developed images. Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency. These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.
本文介绍了Raw Natural Image Noise Dataset(简称RawNIND),这是一个包含成对原始图像的多样化数据集,旨在支持跨传感器、图像开发工作流程和风格进行去噪模型泛化的开发。提出了两种去噪方法:一种直接在未经处理的Bayer数据上操作,利用计算效率;另一种则处理线性RGB图像,以提高不同传感器上的泛化能力,两者都保留了后续开发的灵活性。这两种方法均优于依赖于已开发图像的传统方法。此外,在原始数据级别集成去噪和压缩显著提高了速率失真性能和计算效率。这些发现表明了一种向使用原始数据工作流程进行高效且灵活图像处理的范式转变的趋势。
https://arxiv.org/abs/2501.08924
This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.
本文介绍了一种新颖的方法,通过结合精确扩散逆向变换框架(EDICT)来提升高斯着色技术的性能。高斯着色是一种广泛使用的水印技术,它通常在噪声潜在空间中嵌入水印,并通过迭代去噪生成图像和添加噪声恢复水印。然而,其逆过程不准确,可能导致水印失真。我们提出利用EDICT精确推导逆向映射的能力来优化这一过程。我们的方法包括复制包含水印的噪声潜变量,并在两个潜变量之间交替进行去噪和加噪操作,借助于EDICT框架实现。这使得图像和嵌入水印的重构更为精准。 通过标准数据集上的实证评估表明,我们的集成方法使水印恢复保真度有了细微但统计上显著的提升。这些结果突显了EDICT增强现有基于扩散技术的水印处理能力,并提供更准确、更具鲁棒性的逆向机制的潜力。据我们所知,这是首次探索EDICT与高斯着色在数字水印中的协同作用的研究工作,为研究稳健且高保真的水印嵌入和提取开辟了新的途径。
https://arxiv.org/abs/2501.08604
Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.
扩散模型在图像生成方面取得了尖端性能,然而,它们冗长的去噪过程和计算密集型得分估计网络阻碍了其在低延迟和资源受限场景中的扩展性。后训练量化(PTQ)能够在不重新训练的情况下压缩并加速扩散模型,但不可避免地引入额外的量化噪声,导致均值和方差偏差。在这项工作中,我们提出了D2-DPM,这是一种双去噪机制,旨在精确缓解量化噪声对噪声估计网络的不利影响。具体而言,我们首先将量化噪声对采样公式的影响力分解为两个组成部分:均值偏移和方差偏移。均值偏移改变了采样公式的漂移系数,从而影响了轨迹趋势;而方差偏移则放大了扩散系数,进而影响了采样轨迹的收敛性。因此,所提出的D2-DPM机制被设计为在每个时间步长去噪量化噪声,并通过逆向扩散迭代对噪音样本进行去噪处理。 实验结果表明,D2-DPM实现了卓越的生成质量,在达到3.99倍压缩和11.67倍比特操作加速的同时,其FID值比全精度模型低了1.42。
https://arxiv.org/abs/2501.08180
As critical visual details become obscured, the low visibility and high ISO noise in extremely low-light images pose a significant challenge to human pose estimation. Current methods fail to provide high-quality representations due to reliance on pixel-level enhancements that compromise semantics and the inability to effectively handle extreme low-light conditions for robust feature learning. In this work, we propose a frequency-based framework for low-light human pose estimation, rooted in the "divide-and-conquer" principle. Instead of uniformly enhancing the entire image, our method focuses on task-relevant information. By applying dynamic illumination correction to the low-frequency components and low-rank denoising to the high-frequency components, we effectively enhance both the semantic and texture information essential for accurate pose estimation. As a result, this targeted enhancement method results in robust, high-quality representations, significantly improving pose estimation performance. Extensive experiments demonstrating its superiority over state-of-the-art methods in various challenging low-light scenarios.
当关键视觉细节在极低光环境下变得模糊时,低可见度和高ISO噪声对人类姿态估计构成了重大挑战。当前的方法由于依赖于像素级增强技术而无法提供高质量的表示,这些技术会损害语义信息,并且无法有效地处理极端低光照条件下的稳健特征学习问题。在这项工作中,我们提出了一种基于频率的极低光环境下的人体姿态估计框架,该框架采用了“分而治之”的原则。我们的方法不均匀地增强整个图像,而是专注于任务相关的数据。通过将动态照明校正应用于低频成分,并对高频成分进行低秩去噪处理,我们可以有效提升对于准确的姿态估计至关重要的语义和纹理信息。这种定向增强的方法产生了稳健且高质量的表示形式,显著提高了姿态估计性能。在各种具有挑战性的低光场景中的广泛实验表明了该方法优于最先进的技术。
https://arxiv.org/abs/2501.08038
Optical remote sensing images play a crucial role in the observation of the Earth's surface. However, obtaining complete optical remote sensing images is challenging due to cloud cover. Reconstructing cloud-free optical images has become a major task in recent years. This paper presents a two-flow Polarimetric Synthetic Aperture Radar (PolSAR)-Optical data fusion cloud removal algorithm (PODF-CR), which achieves the reconstruction of missing optical images. PODF-CR consists of an encoding module and a decoding module. The encoding module includes two parallel branches that extract PolSAR image features and optical image features. To address speckle noise in PolSAR images, we introduce dynamic filters in the PolSAR branch for image denoising. To better facilitate the fusion between multimodal optical images and PolSAR images, we propose fusion blocks based on cross-skip connections to enable interaction of multimodal data information. The obtained fusion features are refined through an attention mechanism to provide better conditions for the subsequent decoding of the fused images. In the decoding module, multi-scale convolution is introduced to obtain multi-scale information. Additionally, to better utilize comprehensive scattering information and polarization characteristics to assist in the restoration of optical images, we use a dataset for cloud restoration called OPT-BCFSAR-PFSAR, which includes backscatter coefficient feature images and polarization feature images obtained from PoLSAR data and optical images. Experimental results demonstrate that this method outperforms existing methods in both qualitative and quantitative evaluations.
光学遥感图像在地球表面观测中扮演着重要角色。然而,由于云层覆盖,获取完整的光学遥感图像是一个挑战。因此,重建无云的光学图像已成为近年来的一项重要任务。本文提出了一种两流极化合成孔径雷达(PolSAR)-光学数据融合去云算法(PODF-CR),该算法能够实现缺失光学图像的重建。PODF-CR由编码模块和解码模块组成。 **编码模块**包括两个并行分支,分别用于提取PolSAR图像特征和光学图像特征。为了处理PolSAR图像中的斑点噪声,我们在PolSAR支路中引入了动态滤波器以进行图像去噪。为了更好地促进多模态光学图像与PolSAR图像之间的融合,我们提出了一种基于跨跳连接的融合块,以便于多模态数据信息的交互作用。通过注意力机制对获得的融合特征进行细化处理,为后续的解码提供更好的条件。 **解码模块**引入了多尺度卷积以获取多尺度信息。此外,为了更好地利用综合散射信息和极化特性来辅助光学图像的恢复,我们使用了一种名为OPT-BCFSAR-PFSAR的数据集进行云层去除实验,该数据集中包含了从PoLSAR数据和光学图像中提取的后向散射系数特征图和极化特征图。 实验结果表明,在定性和定量评估方面,该方法优于现有的其他方法。
https://arxiv.org/abs/2501.07901
Generative AI, powered by large language models (LLMs), has revolutionized applications across text, audio, images, and video. This study focuses on developing and evaluating encoder-decoder architectures for the American Sign Language (ASL) image dataset, consisting of 87,000 images across 29 hand sign classes. Three approaches were compared: Feedforward Autoencoders, Convolutional Autoencoders, and Diffusion Autoencoders. The Diffusion Autoencoder outperformed the others, achieving the lowest mean squared error (MSE) and highest Mean Opinion Score (MOS) due to its probabilistic noise modeling and iterative denoising capabilities. The Convolutional Autoencoder demonstrated effective spatial feature extraction but lacked the robustness of the diffusion process, while the Feedforward Autoencoder served as a baseline with limitations in handling complex image data. Objective and subjective evaluations confirmed the superiority of the Diffusion Autoencoder for high-fidelity image reconstruction, emphasizing its potential in multimodal AI applications such as sign language recognition and generation. This work provides critical insights into designing robust encoder-decoder systems to advance multimodal AI capabilities.
由大型语言模型(LLM)驱动的生成式AI已在文本、音频、图像和视频等多个领域实现了革命性的应用。本研究专注于开发并评估用于美国手语(ASL)图像数据集的编码-解码架构,该数据集包含29个手势类别,总计87,000张图片。三种方法进行了比较:全连接自编码器、卷积自编码器和扩散自编码器。 扩散自编码器在性能上优于其他两种模型,它实现了最低的均方误差(MSE)以及最高的主观意见评分(MOS),这得益于其概率噪声建模能力和迭代去噪能力。虽然卷积自编码器展示了有效的空间特征提取能力,但在鲁棒性方面无法与扩散过程相比;全连接自编码器则作为基线模型存在,在处理复杂图像数据时表现出局限性。 客观和主观评估确认了扩散自编码器在高保真度图像重建方面的优越性能,并强调其在多模态AI应用(如手语识别和生成)中的潜力。这项工作为设计稳健的编码-解码系统以提升多模态人工智能能力提供了关键见解。
https://arxiv.org/abs/2501.06942
Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without physically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses. This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process. By transforming garment, pose, and appearance images into latent features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthesis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with fine-grained texture details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping processes. Future work will focus on generating VTON outputs in video format and on applying our attention mechanism, as detailed in the Method section, to other domains with limited data.
虚拟试穿(Virtual Try-On,VTON)技术允许用户在不实际试穿的情况下可视化服装的穿着效果。随着数字化和线上购物的发展,这种技术越来越受欢迎。传统的 VTON 方法通常使用生成对抗网络 (GANs) 和扩散模型,但这些方法在实现高度逼真度和处理动态姿势方面存在挑战。本文介绍了一种名为“姿态引导条件下的衣着扩散”(Outfitting Diffusion with Pose Guided Condition, ODPG)的新颖方法,该方法利用带有多个条件输入的潜在扩散模型,在去噪过程中生成高质量图像。 ODPG 通过将服装、姿势和外观图像转换为潜在特征,并在基于 UNet 的去噪模型中集成这些特征,实现了动态姿态下人体图像上衣物的非显式合成。我们对 FashionTryOn 数据集和 DeepFashion 数据集的一个子集进行了实验,结果显示 ODPG 能够生成具有精细纹理细节的真实 VTON 图像,跨越多种姿势,并且使用了一种端到端架构,无需显式的服装变形过程。 未来的工作将集中在生成视频格式的 VTON 输出上,并计划在数据有限的其他领域中应用我们方法中的注意力机制。
https://arxiv.org/abs/2501.06769
Wearable electrocardiogram (ECG) measurement using dry electrodes has a problem with high-intensity noise distortion. Hence, a robust noise reduction method is required. However, overlapping frequency bands of ECG and noise make noise reduction difficult. Hence, it is necessary to provide a mechanism that changes the characteristics of the noise based on its intensity and type. This study proposes a convolutional neural network (CNN) model with an additional wavelet transform layer that extracts the specific frequency features in a clean ECG. Testing confirms that the proposed method effectively predicts accurate ECG behavior with reduced noise by accounting for all frequency domains. In an experiment, noisy signals in the signal-to-noise ratio (SNR) range of -10-10 are evaluated, demonstrating that the efficiency of the proposed method is higher when the SNR is small.
使用干电极的可穿戴心电图(ECG)测量存在高强度噪声干扰的问题,因此需要一种稳健的降噪方法。然而,由于心电信号和噪声在频谱上的重叠特性,使得降噪变得困难。为了有效应对这一挑战,研究提出了一种机制来根据噪声强度和类型改变其特征。 本研究提出了一种附加了小波变换层的卷积神经网络(CNN)模型,该模型可以从干净的心电图中提取特定频率特征。测试结果表明,所提出的这种方法通过考虑所有频域,在预测心电信号行为时能够有效减少噪声,从而提高准确度。实验评估了信号与噪声比(SNR)范围在-10至10之间的嘈杂信号,结果显示当SNR较低时,该方法的效率更高。
https://arxiv.org/abs/2501.06724
Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.
视频修复在恢复降级的视频内容方面发挥着关键作用,通过纠正由拍摄(如传感器噪声、运动模糊等)、保存/分享(压缩、调整大小等)和编辑过程中引入的各种退化因素所引起的缺陷。本文介绍了一种旨在处理视频捕获过程中引入噪点场景的新算法,目的是通过减少不需要的噪点来提高视频的视觉质量。 我们提出了一种端到端盲去噪模型——潜空间LSTM视频去噪器(LLVD)。该模型独特地结合了空间和时间特征提取,并在编码特征域中采用了长短期记忆网络(LSTM),这对于保持恢复视频中的连续性和减少闪烁至关重要。此外,在编码特征域处理帧显著减少了计算量,从而形成了一个非常轻量级的架构。 由于其盲特性,LLVD非常适合实际应用中的去噪场景,即没有关于噪声特性的先验信息的情况下使用。实验结果显示,LLVD在合成和捕获噪音方面表现出色。具体来说,在RAW降噪方面,LLVD超越了当前的最佳性能(SOTA)0.3dB,并且计算复杂性减少了59%。
https://arxiv.org/abs/2501.05744
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, RGB images are susceptible to limitations such as sensitivity to lighting conditions and potential user discomfort. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance and the imperative for continual learning. In this work, we introduce a novel balanced continual multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to quantify the contribution of each modality and identify modality imbalance. To address this imbalance, we employ a re-learning strategy. Furthermore, recognizing that raw data is prone to noise contamination, we develop a novel denoising continual learning approach. This approach incorporates a noise identification and separation module to mitigate the adverse effects of noise and collaborates with the balanced learning strategy to enhance optimization. Additionally, an adaptive EWC mechanism is employed to alleviate catastrophic forgetting. We conduct extensive experiments on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the superiority of our approach in boosting 3D pose estimation and mitigating catastrophic forgetting in complex scenarios. We will release our codes.
三维人体姿态估计(3D HPE)已经成为基于RGB方法的重要研究课题。然而,RGB图像容易受到光照条件敏感性和潜在用户不适的限制。因此,利用非侵入式传感器进行多模态感知的方法正逐渐获得越来越多的关注。尽管如此,多模态3D HPE仍然面临着诸如模式不平衡和持续学习必要性等挑战。在本项工作中,我们提出了一种新颖的平衡连续多模态学习方法用于3D HPE,该方法利用了RGB、LiDAR、毫米波(mmWave)以及WiFi等多种传感器数据。 具体来说,我们提出了基于Shapley值的贡献算法来量化每种模式的贡献并识别模式不平衡。为解决这种不平衡问题,我们采用了重新学习策略。此外,考虑到原始数据容易受到噪声污染,我们开发了一种新颖的去噪连续学习方法。这种方法包含了一个用于减少噪声负面影响的噪声识别和分离模块,并与平衡学习策略协同工作以增强优化过程。另外,还采用了一种自适应EWC机制来缓解灾难性遗忘问题。 我们在广泛使用的多模态数据集MM-Fi上进行了广泛的实验,结果表明我们的方法在提升3D姿态估计性能以及减少复杂场景中的灾难性遗忘方面具有显著优势。我们将发布代码。
https://arxiv.org/abs/2501.05264
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in this https URL.
扩散模型由于在模拟复杂数据分布方面的说服力表现,在生成领域得到了广泛的应用。此外,它们还在判别性任务(如图像分割)中表现出具有竞争力的结果。尽管扩散模型已被探索用于自动音乐转录,但其性能尚未达到竞争水平。在这篇论文中,我们专注于离散扩散模型的改进能力,并提出了一种新的架构来处理钢琴转录问题。我们的模型利用了邻里注意力层作为去噪模块,在预训练声学模型精调特征的基础上,逐步预测目标高分辨率的钢琴滚筒图(piano roll)。为了进一步增强改进效果,我们设计了一种新策略,该策略在离散扩散模型的训练和推理阶段应用不同的过渡状态。我们在MAESTRO数据集上的实验表明,在F1分数方面,我们的方法优于先前基于扩散的钢琴转录模型及基线模型。我们的代码可以在提供的链接中找到(请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2501.05068
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at this https URL.
最近,Transformer网络由于其全局感受野和对输入的适应性,在图像恢复领域展示了出色的表现。然而,Softmax注意力机制的二次计算复杂度对其在图像恢复任务中的广泛应用构成了重大限制,尤其是在处理高分辨率图像时更为显著。为了解决这一挑战,我们提出了一种新的Transformer变体。这种变体利用泰勒展开来近似Softmax注意力,并采用保持范数映射的概念来近似一阶泰勒展开的余项,从而实现线性计算复杂度。此外,我们还引入了多分支架构和多尺度补丁嵌入到所提出的Transformer中,这一设计带来了四个显著优势:1)不同大小的感受野;2)多层次语义信息;3)灵活形状的感受野;4)加速训练和推理速度。 因此,我们的模型被命名为基于泰勒公式展开的第二代Transformer(简称MB-TaylorFormer V2),具备同时处理粗到细特征的能力,以有限的计算成本捕捉长距离像素交互,并改进泰勒展开余项的近似。在多个图像恢复基准测试中进行的实验表明,MB-TaylorFormer V2 在多种图像恢复任务上达到了最先进的性能,包括去雾、去雨、除雪、运动模糊消除和去噪等,同时计算开销非常小。该模型的源代码可在此网址获得。
https://arxiv.org/abs/2501.04486
Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
创建高保真度、连贯的长视频是一项备受追捧的目标。尽管最近的视频扩散模型显示出令人鼓舞的潜力,但它们仍然面临着时空不一致性和计算资源需求高的挑战。我们提出了一种无需调整(tuning-free)的方法GLC-Diffusion,用于生成长视频。此方法通过全局-局部协作去噪建立视频降噪轨迹,以确保整体内容一致性以及帧之间的时间连贯性。 此外,我们引入了噪声重新初始化策略,结合局部噪声重排序与频谱融合来提高全局内容的一致性和视觉多样性。 进一步地,我们提出了一种视频运动一致性细化(VMCR)模块,通过计算像素级和频率级损失的梯度来增强视觉一致性和时间平滑性。 广泛的实验,包括不同长度视频(例如,3倍和6倍更长)的定量与定性评估,证明了我们的方法能够有效集成到现有的视频扩散模型中,并产生比以往方法更好的连贯、高保真的长视频。
https://arxiv.org/abs/2501.05484
Generating molecular graphs is a challenging task due to their discrete nature and the competitive objectives involved. Diffusion models have emerged as SOTA approaches in data generation across various modalities. For molecular graphs, graph neural networks (GNNs) as a diffusion backbone have achieved impressive results. Latent space diffusion, where diffusion occurs in a low-dimensional space via an autoencoder, has demonstrated computational efficiency. However, the literature on latent space diffusion for molecular graphs is scarce, and no commonly accepted best practices exist. In this work, we explore different approaches and hyperparameters, contrasting generative flow models (denoising diffusion, flow matching, heat dissipation) and architectures (GNNs and E(3)-equivariant GNNs). Our experiments reveal a high sensitivity to the choice of approach and design decisions. Code is made available at this http URL.
生成分子图是一项具有挑战性的任务,由于其离散性质以及涉及的竞争目标。扩散模型已在各种模态的数据生成中作为最先进的方法出现。对于分子图而言,以图神经网络(GNN)为扩散骨干的方法已经取得了令人印象深刻的结果。在潜在空间进行扩散——通过自动编码器在低维空间中发生扩散——已显示出计算效率。然而,关于分子图的潜在空间扩散的相关文献很少,并且没有普遍接受的最佳实践方法。 在这项工作中,我们探索了不同的方法和超参数设置,比较了几种生成流模型(去噪扩散、流动匹配、热耗散)以及不同的架构(GNNs 和 E(3) 等变 GNNs)。我们的实验表明,选择的方法和设计决策对结果的敏感度很高。相关代码可在 [此链接](http://this.http.url/) 获取。
https://arxiv.org/abs/2501.03696
Editability and fidelity are two essential demands for text-driven image editing, which expects that the editing area should align with the target prompt and the rest should remain unchanged separately. The current cutting-edge editing methods usually obey an "inversion-then-editing" pipeline, where the source image is first inverted to an approximate Gaussian noise ${z}_T$, based on which a sampling process is conducted using the target prompt. Nevertheless, we argue that it is not a good choice to use a near-Gaussian noise as a pivot for further editing since it almost lost all structure fidelity. We verify this by a pilot experiment, discovering that some intermediate-inverted latents can achieve a better trade-off between editability and fidelity than the fully-inverted ${z}_T$. Based on this, we propose a novel editing paradigm dubbed ZZEdit, which gentlely strengthens the target guidance on a sufficient-for-editing while structure-preserving latent. Specifically, we locate such an editing pivot by searching the first point on the inversion trajectory which has larger response levels toward the target prompt than the source one. Then, we propose a ZigZag process to perform mild target guiding on this pivot, which fulfills denoising and inversion iteratively, approaching the target while still holding fidelity. Afterwards, to achieve the same number of inversion and denoising steps, we perform a pure sampling process under the target prompt. Extensive experiments highlight the effectiveness of our ZZEdit in diverse image editing scenarios compared with the "inversion-then-editing" pipeline.
编辑性和保真性是基于文本的图像编辑中两个基本需求,期望编辑区域能够与目标提示对齐,而其余部分保持不变。当前最先进的编辑方法通常遵循“反转后编辑”的管道流程,在该流程中,首先将源图像逆变为近似的高斯噪声${z}_T$,然后在此基础上使用目标提示进行采样过程。然而,我们认为使用接近高斯噪声作为进一步编辑的中心点并不是一个好的选择,因为这几乎丧失了所有结构保真性。我们通过一个先导实验验证了这一点,发现一些中间逆变潜在层可以在可编辑性和保真性之间取得更好的平衡优于完全逆变为${z}_T$的状态。 基于此观察结果,我们提出了一种名为ZZEdit的新颖编辑范式,它在具有足够编辑潜力的同时保持结构完整性的潜在点上温和地增强了目标指引。具体来说,我们通过搜索反转轨迹上的第一个响应级别大于源图像的目标提示的点来定位这样的编辑中心点。然后,我们提出了一个ZigZag过程,在此过程中在这个中心点执行轻度的目标引导,该过程在去噪和逆变之间迭代交替进行,接近目标的同时仍保持保真性。之后,为了达到相同数量的逆变和去噪步骤,我们在目标提示下执行纯采样过程。 广泛的实验显示了我们的ZZEdit范式在各种图像编辑场景中的有效性,相对于“反转后编辑”管道而言具有明显优势。
https://arxiv.org/abs/2501.03631