Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: this https URL
图像恢复的目标是修复受损的图像。然而,现有的基于扩散的方法虽然在自然图像恢复方面取得了巨大成功,但在处理含有文本区域的受损图像时往往难以准确重构这些文本区域。这些方法经常会生成看似合理但实际上错误的文字样图案,我们称之为“图文幻觉”。在这篇论文中,我们提出了图文感知图像恢复(TAIR),这是一个新的任务,要求同时恢复视觉内容和文本准确性。 为了应对这一挑战,我们构建了SA-Text,一个大规模基准数据集,包含10万张高质量的场景图片,并密集标注有各种复杂多样的文本实例。此外,我们提出了一种多任务扩散框架,称为TeReDiff,该框架将扩散模型中的内部特征整合到文字识别模块中,使得两个组件能够从联合训练中受益。这允许提取丰富的文本表示,在后续去噪步骤中将其用作提示。 大量的实验表明,我们的方法在图像恢复方面持续优于最先进的方法,并且在提高文本识别准确性方面取得了显著的成果。请参见我们的项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2506.09993
Diffusion models have recently emerged as a powerful approach for trajectory planning. However, their inherently non-sequential nature limits their effectiveness in long-horizon reasoning tasks at test time. The recently proposed Monte Carlo Tree Diffusion (MCTD) offers a promising solution by combining diffusion with tree-based search, achieving state-of-the-art performance on complex planning problems. Despite its strengths, our analysis shows that MCTD incurs substantial computational overhead due to the sequential nature of tree search and the cost of iterative denoising. To address this, we propose Fast-MCTD, a more efficient variant that preserves the strengths of MCTD while significantly improving its speed and scalability. Fast-MCTD integrates two techniques: Parallel MCTD, which enables parallel rollouts via delayed tree updates and redundancy-aware selection; and Sparse MCTD, which reduces rollout length through trajectory coarsening. Experiments show that Fast-MCTD achieves up to 100x speedup over standard MCTD while maintaining or improving planning performance. Remarkably, it even outperforms Diffuser in inference speed on some tasks, despite Diffuser requiring no search and yielding weaker solutions. These results position Fast-MCTD as a practical and scalable solution for diffusion-based inference-time reasoning.
最近,扩散模型作为一种强大的轨迹规划方法崭露头角。然而,由于它们本质上是非顺序的特性,在测试时对于长期推理任务的有效性受到限制。最近提出的蒙特卡洛树扩散(MCTD)通过结合扩散与基于树的搜索提供了一个有希望的解决方案,并在复杂规划问题上达到了最先进的性能水平。尽管其具有优势,但我们的分析显示,由于基于树搜索的顺序性质以及迭代去噪的成本,MCTD带来了显著的计算开销。为了解决这个问题,我们提出了Fast-MCTD,这是一种更高效的变体,在保持MCTD优点的同时大大提高了速度和可扩展性。 Fast-MCTD整合了两种技术:平行MCTD(Parallel MCTD)允许通过延迟树更新和冗余感知选择实现并行回溯;稀疏MCTD(Sparse MCTD)则通过轨迹简化减少回溯长度。实验结果表明,与标准的MCTD相比,Fast-MCTD在保持或提高规划性能的同时实现了高达100倍的速度提升。尤为值得注意的是,在某些任务上,尽管Diffuser不需要搜索并且提供的解决方案较弱,但Fast-MCTD仍然在推理速度方面超越了它。 这些结果表明,Fast-MCTD为基于扩散的推理时间推理提供了一个实用且可扩展的解决方案。
https://arxiv.org/abs/2506.09498
In many complex scenarios, robotic manipulation relies on generative models to estimate the distribution of multiple successful actions. As the diffusion model has better training robustness than other generative models, it performs well in imitation learning through successful robot demonstrations. However, the diffusion-based policy methods typically require significant time to iteratively denoise robot actions, which hinders real-time responses in robotic manipulation. Moreover, existing diffusion policies model a time-varying action denoising process, whose temporal complexity increases the difficulty of model training and leads to suboptimal action accuracy. To generate robot actions efficiently and accurately, we present the Time-Unified Diffusion Policy (TUDP), which utilizes action recognition capabilities to build a time-unified denoising process. On the one hand, we build a time-unified velocity field in action space with additional action discrimination information. By unifying all timesteps of action denoising, our velocity field reduces the difficulty of policy learning and speeds up action generation. On the other hand, we propose an action-wise training method, which introduces an action discrimination branch to supply additional action discrimination information. Through action-wise training, the TUDP implicitly learns the ability to discern successful actions to better denoising accuracy. Our method achieves state-of-the-art performance on RLBench with the highest success rate of 82.6% on a multi-view setup and 83.8% on a single-view setup. In particular, when using fewer denoising iterations, TUDP achieves a more significant improvement in success rate. Additionally, TUDP can produce accurate actions for a wide range of real-world tasks.
在许多复杂的场景中,机器人操作依赖于生成模型来估计多种成功动作的分布。由于扩散模型比其他生成模型具有更好的训练鲁棒性,在通过成功的机器人演示进行模仿学习时表现出色。然而,基于扩散的方法通常需要花费大量时间来进行迭代去噪处理,这阻碍了机器人操作中的实时响应能力。此外,现有的扩散策略模型了一个随时间变化的动作去噪过程,其时间复杂度增加了模型训练的难度,并导致动作精度次优。 为了高效且准确地生成机器人动作,我们提出了时序统一扩散策略(Time-Unified Diffusion Policy, TUDP),利用动作识别能力构建了一个时序统一的去噪过程。一方面,我们在动作空间中建立一个带附加动作判别信息的时间统一速度场。通过将所有时间步的动作去噪统一起来,我们的速度场降低了策略学习的难度并加快了动作生成的速度。 另一方面,我们提出了一种基于动作的训练方法,并引入了一个动作判别分支来提供额外的动作判别信息。通过这种基于动作的训练方式,TUDP隐式地学会了识别成功动作的能力,从而提高了去噪精度。 我们的方法在RLBench基准测试中取得了最先进的性能,在多视角设置下达到了82.6%的成功率,在单视角设置下则为83.8%,表现出色。特别值得注意的是,当使用较少的去噪迭代时,TUDP能够显著提高成功率。此外,TUDP还能为广泛的真实世界任务生成准确的动作。
https://arxiv.org/abs/2506.09422
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.
我们提出了一种新的方法,称为噪声条件变分分数蒸馏(Noise Conditional Variational Score Distillation, NCVSD),用于将预训练的扩散模型转化为生成去噪器。通过揭示无条件分数函数隐含地表征了去噪后验分布的分数函数这一事实,我们将这一见解融入到变分分数蒸馏(VSD)框架中,从而实现了能够从广泛的噪声水平下近似采样自去噪后验分布的大规模生成去噪器的学习。所提出的生成去噪器具有如下优点:允许快速生成同时保留迭代细化的好处: 1. 在高噪声级别上通过从纯高斯噪声采样进行一步式生成; 2. 通过多步采样的测试时计算扩展来提高样本质量; 3. 对于灵活和可控的采样提供零样本概率推理。 我们通过对类条件图像生成和逆问题求解等广泛实验对NCVSD进行了评估。通过扩展测试时间计算,我们的方法优于教师扩散模型,并且与更大规模的一致性模型相当。此外,在使用比基于扩散的方法显著更少的NFE(网络函数评估)的情况下,我们在逆问题上的LPIPS达到了创纪录的成绩。
https://arxiv.org/abs/2506.09416
Previous studies on event camera sensing have demonstrated certain detection performance using dense event representations. However, the accumulated noise in such dense representations has received insufficient attention, which degrades the representation quality and increases the likelihood of missed detections. To address this challenge, we propose the Wavelet Denoising-enhanced DEtection TRansformer, i.e., WD-DETR network, for event cameras. In particular, a dense event representation is presented first, which enables real-time reconstruction of events as tensors. Then, a wavelet transform method is designed to filter noise in the event representations. Such a method is integrated into the backbone for feature extraction. The extracted features are subsequently fed into a transformer-based network for object prediction. To further reduce inference time, we incorporate the Dynamic Reorganization Convolution Block (DRCB) as a fusion module within the hybrid encoder. The proposed method has been evaluated on three event-based object detection datasets, i.e., DSEC, Gen1, and 1Mpx. The results demonstrate that WD-DETR outperforms tested state-of-the-art methods. Additionally, we implement our approach on a common onboard computer for robots, the NVIDIA Jetson Orin NX, achieving a high frame rate of approximately 35 FPS using TensorRT FP16, which is exceptionally well-suited for real-time perception of onboard robotic systems.
之前关于事件相机传感的研究已经展示了使用密集事件表示的某些检测性能。然而,这种密集表示中的累积噪声并未得到足够的关注,这会降低表示的质量并增加漏检的可能性。为了解决这一挑战,我们提出了一种用于事件相机的Wavelet Denoising-enhanced DEtection TRansformer(简称WD-DETR)网络。具体来说,首先提供一种密集事件表示方法,这种方法可以实时将事件重构为张量形式。然后设计了一种小波变换方法来过滤掉事件表示中的噪声,并将其整合到主干网中用于特征提取。接着,抽取的特征被输入基于变压器的网络进行目标预测。为了进一步减少推理时间,我们在混合编码器内集成动态重组卷积模块(DRCB)作为融合模块。 所提出的方法已在三个基于事件的目标检测数据集上进行了评估,即DSEC、Gen1和1Mpx。结果表明,WD-DETR优于已测试的最先进的方法。此外,我们还在一个常见的机器人板载计算机——NVIDIA Jetson Orin NX 上实现了我们的方法,并使用TensorRT FP16获得了约35 FPS的高帧率,这非常适合于车载机器人系统的实时感知。
https://arxiv.org/abs/2506.09098
Multi-head self-attention (MHSA) has become a core component in modern computer vision models. However, its quadratic complexity with respect to input length poses a significant computational bottleneck in real-time and resource constrained environments. We propose PnP-Nystra, a Nyström based linear approximation of self-attention, developed as a plug-and-play (PnP) module that can be integrated into the pre-trained image and video restoration models without retraining. As a drop-in replacement for MHSA, PnP-Nystra enables efficient acceleration in various window-based transformer architectures, including SwinIR, Uformer, and RVRT. Our experiments across diverse image and video restoration tasks, including denoising, deblurring, and super-resolution, demonstrate that PnP-Nystra achieves a 2-4x speed-up on an NVIDIA RTX 4090 GPU and a 2-5x speed-up on CPU inference. Despite these significant gains, the method incurs a maximum PSNR drop of only 1.5 dB across all evaluated tasks. To the best of our knowledge, we are the first to demonstrate a linear attention functioning as a training-free substitute for MHSA in restoration models.
多头自注意力(MHSA)已成为现代计算机视觉模型中的核心组件。然而,它对输入长度的二次复杂度在实时和资源受限环境中带来了显著的计算瓶颈。我们提出了一种基于Nyström线性近似自注意机制的PnP-Nystra模块,作为可插拔(Plug-and-Play, PnP)模块,它可以被集成到预训练的图像和视频修复模型中而不需重新训练。作为MHSA的直接替代品,PnP-Nystra能够在包括SwinIR、Uformer及RVRT在内的各种基于窗口的变换器架构中实现高效的加速。 在广泛的图像和视频修复任务(如去噪、去模糊及超分辨率)上的实验显示,与多头自注意力相比,PnP-Nystra在NVIDIA RTX 4090 GPU上实现了2-4倍的速度提升,在CPU推理方面则达到了2-5倍的加速。尽管取得了显著的速度改进,该方法在整个评估任务中的峰值信噪比(PSNR)最大下降仅为1.5 dB。 据我们所知,这是首次展示一种训练免费的方法能够用线性注意力机制代替MHSA,并且在修复模型中表现出良好的性能。
https://arxiv.org/abs/2506.08520
Safety validation of autonomous driving systems is extremely challenging due to the high risks and costs of real-world testing as well as the rarity and diversity of potential failures. To address these challenges, we train a denoising diffusion model to generate potential failure cases of an autonomous vehicle given any initial traffic state. Experiments on a four-way intersection problem show that in a variety of scenarios, the diffusion model can generate realistic failure samples while capturing a wide variety of potential failures. Our model does not require any external training dataset, can perform training and inference with modest computing resources, and does not assume any prior knowledge of the system under test, with applicability to safety validation for traffic intersections.
自动驾驶系统的安全性验证极其具有挑战性,这主要是由于真实世界测试中的高风险和高昂成本以及潜在故障的罕见性和多样性。为了解决这些难题,我们训练了一个去噪扩散模型来生成给定任何初始交通状态下的自主车辆可能发生的故障案例。在四路交叉口问题上的实验表明,在各种场景下,该扩散模型能够生成现实的失败样本,并捕捉到广泛的潜在故障类型。我们的模型不需要任何外部训练数据集,可以在适度的计算资源上完成训练和推理过程,而且无需假设被测试系统的先验知识,适用于交通交叉口的安全验证。
https://arxiv.org/abs/2506.08459
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
最近的文本到视频(T2V)模型展示了生成高质量动态视频的强大能力。为了提高视觉可控性,近期的研究工作考虑了通过微调预训练的T2V模型来支持图像到视频(I2V)的生成。然而,这种适应方法往往抑制了生成输出中的运动动态,导致与原始T2V模型相比,生成的视频更加静态化。在本文中,我们分析了这一现象,并确定其根源在于过早地接触输入图像中的高频细节,这使得采样过程偏向于一条捷径路径,过度拟合参考图像的静态外观。为解决这个问题,我们提出了自适应低通引导(ALG),这是一种简单的修复方法,通过调整I2V模型的采样程序来生成更多动态化的视频而不牺牲单帧图像的质量。 具体来说,ALG通过对条件图像在去噪早期阶段应用低通滤波的方式,灵活地调节频率内容。大量的实验表明,ALG显著提高了生成视频的时间动态性,同时保持了图像的真实度和文本的一致性。尤其是在VBench-I2V测试套件中,ALG实现了36%的平均动态程度提升,而没有导致视频质量或图像真实度的明显下降。
https://arxiv.org/abs/2506.08456
With the rapid development of text-to-vision generation diffusion models, classifier-free guidance has emerged as the most prevalent method for conditioning. However, this approach inherently requires twice as many steps for model forwarding compared to unconditional generation, resulting in significantly higher costs. While previous study has introduced the concept of adaptive guidance, it lacks solid analysis and empirical results, making previous method unable to be applied to general diffusion models. In this work, we present another perspective of applying adaptive guidance and propose Step AG, which is a simple, universally applicable adaptive guidance strategy. Our evaluations focus on both image quality and image-text alignment. whose results indicate that restricting classifier-free guidance to the first several denoising steps is sufficient for generating high-quality, well-conditioned images, achieving an average speedup of 20% to 30%. Such improvement is consistent across different settings such as inference steps, and various models including video generation models, highlighting the superiority of our method.
随着文本到视觉生成扩散模型的快速发展,无分类器指导(classifier-free guidance)已成为最流行的条件设置方法。然而,这种方法相比无条件生成所需的前向传播步骤多了一倍,导致成本显著增加。尽管先前的研究提出了自适应引导的概念,但由于缺乏坚实的分析和实证结果,之前的这种方法无法应用于通用扩散模型中。在本研究中,我们提出了一种应用自适应指导的新视角,并提出了Step AG策略,这是一种简单且普遍适用的自适应引导方法。 我们的评估着重于图像质量和图文一致性,结果显示:将无分类器指导限制在前几个去噪步骤内即可生成高质量且条件良好的图像,在平均速度上提高了20%到30%。这种改进在不同的设置下(如推理步骤)以及各种模型中(包括视频生成模型)都是稳定一致的,突显了我们方法的优势。
https://arxiv.org/abs/2506.08351
Surgeons exhibit distinct operating styles due to differences in training, experience, and motor behavior - yet current AI systems often ignore this personalization signal. We propose a novel approach to model fine-grained, surgeon-specific fingerprinting in robotic surgery using a discrete diffusion framework integrated with a vision-language-action (VLA) pipeline. Our method formulates gesture prediction as a structured sequence denoising task, conditioned on multimodal inputs including endoscopic video, surgical intent language, and a privacy-aware embedding of surgeon identity and skill. Personalized surgeon fingerprinting is encoded through natural language prompts using third-party language models, allowing the model to retain individual behavioral style without exposing explicit identity. We evaluate our method on the JIGSAWS dataset and demonstrate that it accurately reconstructs gesture sequences while learning meaningful motion fingerprints unique to each surgeon. To quantify the privacy implications of personalization, we perform membership inference attacks and find that more expressive embeddings improve task performance but simultaneously increase susceptibility to identity leakage. These findings demonstrate that while personalized embeddings improve performance, they also increase vulnerability to identity leakage, revealing the importance of balancing personalization with privacy risk in surgical modeling. Code is available at: this https URL.
外科医生在手术中的操作风格由于训练、经验和运动行为的差异而有所不同,但现有的AI系统通常忽略了这种个性化信号。我们提出了一种新颖的方法,使用离散扩散框架与视觉-语言-动作(VLA)管道相结合,在机器人手术中建模细微的个体外科医生指纹。我们的方法将手势预测作为基于多模式输入(包括内窥镜视频、手术意图语言和隐私保护的外科医生身份和技能嵌入)的结构化序列去噪任务来制定。个性化外科医生指纹通过使用第三方语言模型的自然语言提示进行编码,使模型能够在不暴露明确身份的情况下保留个体行为风格。我们在JIGSAWS数据集上评估了我们的方法,并证明它可以准确地重建手势序列并学习每个外科医生独特的有意义的动作指纹。 为了量化个性化对隐私的影响,我们执行了成员推断攻击,并发现更表达性的嵌入可以提高任务性能但同时增加身份泄露的敏感性。这些发现表明,在手术建模中,虽然个性化的嵌入可以改善性能,但也增加了识别泄漏的风险,揭示了在个性化与隐私风险之间取得平衡的重要性。 代码可在以下网址获取:[提供链接的位置,请替换为实际链接]
https://arxiv.org/abs/2506.08185
Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
多任务学习在密集预测方面受限于每个任务都需要大量标注的需求,尽管近期的工作已经探索了使用部分任务标签进行训练的方法。通过利用扩散模型的泛化能力,我们将部分学习设置扩展到了零样本(zero-shot)场景,在多个合成数据集上训练一个多任务模型,而这些数据集中每个仅对任务的一部分进行了标记。 我们的方法名为StableMTL,重新设计图像生成器用于潜在回归。通过对去噪框架进行改造、加入任务编码和定制的训练方案来完成这项工作。与需要精心平衡的各种任务损失不同,我们采用了一种统一的潜在空间损失函数,从而能够轻松扩展到更多任务上。 为了促进跨任务间的协同作用,我们引入了一个多流模型,并结合了任务注意力机制。该机制将N对N的任务交互转换为高效的1对N注意力,促进了有效的跨任务共享。StableMTL在涵盖8个基准的7项任务中超越了基线方法。
https://arxiv.org/abs/2506.08013
Recent work on diffusion models proposed that they operate in two regimes: memorization, in which models reproduce their training data, and generalization, in which they generate novel samples. While this has been tested in high-noise settings, the behavior of diffusion models as effective denoisers when the corruption level is small remains unclear. To address this gap, we systematically investigated the behavior of diffusion models under low-noise diffusion dynamics, with implications for model robustness and interpretability. Using (i) CelebA subsets of varying sample sizes and (ii) analytic Gaussian mixture benchmarks, we reveal that models trained on disjoint data diverge near the data manifold even when their high-noise outputs converge. We quantify how training set size, data geometry, and model objective choice shape denoising trajectories and affect score accuracy, providing insights into how these models actually learn representations of data distributions. This work starts to address gaps in our understanding of generative model reliability in practical applications where small perturbations are common.
最近关于扩散模型的研究指出,它们在两个操作模式下运行:记忆化(memorization),即模型复制其训练数据;以及泛化(generalization),即生成新的样本。尽管这种二分法已在高噪音设置中得到了测试,但当腐败水平较低时,扩散模型作为有效去噪器的行为仍然不清楚。为了填补这一空白,我们系统地研究了在低噪声扩散动力学下扩散模型的行为,这对理解模型的鲁棒性和可解释性具有重要意义。 使用(i)不同样本大小的CelebA子集和(ii)分析性的高斯混合基准测试,我们发现即使它们在高噪音输出中收敛,训练数据不相交的数据集所训练的模型也会在其数据流形附近发散。我们量化了训练集大小、数据几何结构以及模型目标选择如何影响去噪轨迹,并如何影响评分准确性,从而提供了这些模型实际学习数据分布表示方式的理解。 这项工作开始填补我们在实践中常见小扰动情况下生成模型可靠性理解方面的空白。
https://arxiv.org/abs/2506.07841
Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.
在动态概率模型中的推理是一项复杂的任务,涉及昂贵的运算操作。特别是在隐马尔可夫模型中,为了向前推进时间,必须枚举整个状态空间,即使那些概率极低的状态也被考虑在内,导致计算效率低下和由于不可能的概率质量传播而增加了噪声。我们提出了一种通过仅使用最可能状态(即累积概率为p的前p个状态)来减少未来噪声并加速推理的方法。我们证明了仅使用前p个状态所引入的误差受p值及所谓模型的基本混合率限制。此外,在我们的实证评估中,我们可以预期至少会获得一个数量级的速度提升,并且在总变差距离(total variation distance)方面,错误低于0.09。
https://arxiv.org/abs/2506.07578
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: this https URL
我们介绍了一种名为DeRAGEC的方法,用于改进自动语音识别(ASR)系统中的命名实体(NE)校正。通过扩展检索增强的生成错误校正(RAGEC)框架,DeRAGEC使用合成去噪理由来过滤出噪声较多的NE候选项,在进行纠正之前。借助音素相似度和增强定义,它利用上下文学习细化从检索中获得的噪声NE,且无需额外训练。在CommonVoice和STOP数据集上的实验结果表明,与基线ASR和RAGEC方法相比,DeRAGEC在词错误率(WER)和命名实体命中率方面有显著改进。具体而言,与未经后处理的ASR系统相比,我们实现了28%相对降低的WER。我们的源代码可公开获取:[此处应提供一个有效的URL链接]
https://arxiv.org/abs/2506.07510
Diffusion models have shown remarkable flexibility for solving inverse problems without task-specific retraining. However, existing approaches such as Manifold Preserving Guided Diffusion (MPGD) apply only a single gradient update per denoising step, limiting restoration fidelity and robustness, especially in embedded or out-of-distribution settings. In this work, we introduce a multistep optimization strategy within each denoising timestep, significantly enhancing image quality, perceptual accuracy, and generalization. Our experiments on super-resolution and Gaussian deblurring demonstrate that increasing the number of gradient updates per step improves LPIPS and PSNR with minimal latency overhead. Notably, we validate this approach on a Jetson Orin Nano using degraded ImageNet and a UAV dataset, showing that MPGD, originally trained on face datasets, generalizes effectively to natural and aerial scenes. Our findings highlight MPGD's potential as a lightweight, plug-and-play restoration module for real-time visual perception in embodied AI agents such as drones and mobile robots.
扩散模型在解决逆向问题时表现出惊人的灵活性,无需针对特定任务进行重新训练。然而,现有的方法如保流形引导扩散(MPGD)仅在每次去噪步骤中应用一次梯度更新,这限制了恢复的精度和鲁棒性,尤其是在嵌入式或分布外设置中更为明显。在这项工作中,我们介绍了一种在每个去噪时间步内进行多步优化策略的方法,从而显著提高了图像质量、感知准确性和泛化能力。我们在超分辨率和高斯模糊去除上的实验表明,增加每一步的梯度更新次数可以提高LPIPS(Learned Perceptual Image Patch Similarity)和PSNR(峰值信噪比),同时仅增加了微小的时间延迟开销。 值得注意的是,我们使用Jetson Orin Nano设备在降级后的ImageNet数据集和无人机数据集上验证了这一方法的有效性。实验表明,最初在面部数据集中训练的MPGD模型可以有效地推广到自然场景和空中图像中。我们的研究结果强调了MPGD作为轻量级、即插即用恢复模块用于实时视觉感知的潜力,特别是在如无人机和移动机器人等具身人工智能代理中的应用。
https://arxiv.org/abs/2506.07286
Face recognition using 3D point clouds is gaining growing interest, while raw point clouds often contain a significant amount of noise due to imperfect sensors. In this paper, an end-to-end 3D face recognition on a noisy point cloud is proposed, which synergistically integrates the denoising and recognition modules. Specifically, a Conditional Generative Adversarial Network on Three Orthogonal Planes (cGAN-TOP) is designed to effectively remove the noise in the point cloud, and recover the underlying features for subsequent recognition. A Linked Dynamic Graph Convolutional Neural Network (LDGCNN) is then adapted to recognize faces from the processed point cloud, which hierarchically links both the local point features and neighboring features of multiple scales. The proposed method is validated on the Bosphorus dataset. It significantly improves the recognition accuracy under all noise settings, with a maximum gain of 14.81%.
基于三维点云的面部识别技术正日益受到关注,然而原始点云数据通常因传感器不完善而包含大量噪声。本文提出了一种针对含噪点云的端到端3D面部识别方法,该方法协同集成了去噪和识别模块。 具体而言,设计了一个基于三个正交平面的条件生成对抗网络(cGAN-TOP),以有效去除点云中的噪声,并恢复后续识别所需的底层特征。接着采用了一种链接动态图卷积神经网络(LDGCNN)来从处理后的点云中识别人脸,该方法层次化地连接了局部点特征及其多尺度邻域特征。 所提出的方法在博斯普鲁斯数据集上进行了验证,在所有噪声设置下均显著提高了识别准确率,最高提升了14.81%。
https://arxiv.org/abs/2506.06864
The main contribution of this paper is to propose an iterative procedure for tubular structure segmentation of 2D images, which combines tight frame of Curvelet transforms with a SURE technique thresholding which is based on principle obtained by minimizing Stein Unbiased Risk Estimate for denoising. This proposed algorithm is mainly based on the TFA proposal presented in [1, 9], which we use eigenvectors of Hessian matrix of image for improving this iterative part in segmenting unclear and narrow vessels and filling the gap between separate pieces of detected vessels. The experimental results are presented to demonstrate the effectiveness of the proposed model.
本文的主要贡献是提出了一种用于二维图像中管状结构分割的迭代过程,该方法结合了Curvelet变换的紧框架与SURE(Stein无偏风险估计最小化原则)技术阈值处理。所提出的算法主要基于文献[1, 9]中的TFA提案,在此基础上我们利用图像Hessian矩阵的特征向量来改进这一迭代部分,以更好地分割模糊且狭窄的血管,并连接检测到的血管片段之间的间隙。实验结果展示了该模型的有效性。
https://arxiv.org/abs/1412.8656
Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.
优化复杂系统,无论是发现治疗药物还是设计高性能材料,在科学和工程领域仍是一项基本挑战。由于底层规则往往未知且评估成本高昂,因此这些系统的优化变得异常困难。离线优化旨在利用预先收集的数据集来优化目标分数的设计,而无需与系统进行交互。然而,传统的做法可能会在超出训练数据的情况下失效,导致预测不准确的评分并生成次优设计。 本文介绍了ManGO框架,这是一个基于扩散的方法,能够学习设计-评分流形,全面捕捉设计和评分之间的相互依赖关系。不同于现有的方法将设计空间和评分空间视为独立处理,ManGO统一了前向预测与后向生成,从而实现了在训练数据之外的泛化能力。其关键在于无导数引导条件生成,结合自适应推理时间缩放,动态优化去噪路径。 广泛的评估表明,在包括合成任务、机器人控制、材料设计、DNA序列以及现实世界工程优化等不同领域的测试中,ManGO优于24种单目标和10种多目标优化方法。
https://arxiv.org/abs/2506.05680
Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.
计算机视觉主要基于二维技术,而三维视觉仅限于相对狭窄的应用领域。然而,通过利用最近在神经辐射场等三维模型方面的进展,一些研究者已经展示出,在将独立的二维视图融合成三维并去噪之后,三维技术可以显著提升输出结果的质量。特别是在第一人称视角(egocentric)视频中,由于摄像机运动明显但假设场景本身是静态的,这种技术特别有用。 然而,正如EPIC Fields最近的一项分析所展示的那样,在研究动态现象时,尤其是在分割移动物体方面,三维技术的效果并不理想。在本文中,我们深入探讨了这一问题。首先,我们提出通过将基于二维模型的运动分割预测与分层辐射场(Layered Motion Fusion)融合来改进动态分割中的三维性能。 然而,长时间、动态视频的高度复杂性使得捕捉其底层几何结构变得困难,并因此阻碍了将运动线索整合到不完整的场景几何中。为了解决这个问题,我们通过测试时间细化方法来帮助模型专注于特定帧,从而降低数据的复杂度。这种方法促进了运动融合与细化之间的协同作用,进而导致三维模型的分割预测大幅度超越二维基准。 这表明在具有挑战性和现实性的环境中,三维技术能够增强对动态现象的二维分析能力。
https://arxiv.org/abs/2506.05546
High-resolution (HR) videos play a crucial role in many computer vision applications. Although existing video restoration (VR) methods can significantly enhance video quality by exploiting temporal information across video frames, they are typically trained for fixed upscaling factors and lack the flexibility to handle scales or degradations beyond their training distribution. In this paper, we introduce VR-INR, a novel video restoration approach based on Implicit Neural Representations (INRs) that is trained only on a single upscaling factor ($\times 4$) but generalizes effectively to arbitrary, unseen super-resolution scales at test time. Notably, VR-INR also performs zero-shot denoising on noisy input, despite never having seen noisy data during training. Our method employs a hierarchical spatial-temporal-texture encoding framework coupled with multi-resolution implicit hash encoding, enabling adaptive decoding of high-resolution and noise-suppressed frames from low-resolution inputs at any desired magnification. Experimental results show that VR-INR consistently maintains high-quality reconstructions at unseen scales and noise during training, significantly outperforming state-of-the-art approaches in sharpness, detail preservation, and denoising efficacy.
高分辨率(HR)视频在许多计算机视觉应用中扮演着至关重要的角色。尽管现有的视频恢复(VR)方法可以通过利用视频帧之间的时序信息显著提升视频质量,但它们通常仅针对固定放大的比例因子进行训练,并且缺乏灵活性以处理超出其训练分布的尺度或退化问题。在这篇论文中,我们提出了基于隐式神经表示(INRs)的新型视频恢复方法VR-INR,该方法只在一个放大型别($\times 4$)上进行训练,但在测试时能够有效地泛化到任意未知的超分辨率尺度。值得注意的是,即使在训练过程中从未接触过噪声数据,VR-INR也能够在输入图像有噪的情况下实现零样本去噪。 我们的方法采用了一种分层的空间-时间-纹理编码框架,并结合了多分辨率隐式哈希编码技术,从而能够从低分辨率的输入中自适应地解码出高分辨率和降噪后的帧,无论在任何所需的放大倍数下都是如此。实验结果表明,在训练过程中未见过的尺度和噪声的情况下,VR-INR始终能保持高质量的重建效果,并且其清晰度、细节保留以及去噪效率方面显著优于当前最先进的方法。
https://arxiv.org/abs/2506.05488