How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
关键词识别(KWS)系统在边缘设备上部署的小型模型面临着由于噪声和录音条件变化导致的领域偏移所引起的准确性和鲁棒性挑战。为了解决这些问题,我们提出了一种全面的连续学习框架,旨在适应新的领域同时保持计算效率。该提议的流程集成了一个双输入卷积神经网络(CNN),利用梅尔频率倒谱系数(MFCC)和梅尔频谱图特征,并结合多级去噪过程,包括离散小波变换和频谱减法技术以及模型更新和原型更新模块。 与以前的方法仅限于特定层的更新不同,我们的方法更新整个量化模型,这得益于紧凑型模型架构。在运行时使用类原型和基于置信度的过滤器选择输入样本的一部分,在这些选定的样本上添加伪标签,并将其与回放缓冲区结合以进行增量模型重训练。 实验结果表明,在嘈杂的数据测试集中该框架的有效性:对于干净数据,准确率达到了99.63%,并且即使在-10 dB信噪比的情况下,也能保持稳健性能(超过94%的准确性),适用于各种噪音环境。这项工作证明了结合高效的去噪技术与基于原型的连续学习可以使KWS模型能够在资源受限和动态环境中自主且鲁棒地运行。
https://arxiv.org/abs/2601.16158
Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model's generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.
扩散模型在性能上达到了最先进的水平,但常常无法生成符合人类偏好和意图的输出,导致图像出现较差的艺术质量和语义不一致性。现有的对齐方法面临着艰难的权衡:微调方法由于奖励过度优化而损失多样性,而在测试时间进行缩放的方法则引入了显著的计算开销,并且往往未能充分优化。为了解决这些限制,我们提出了HyperAlign,这是一种新颖的框架,用于训练一个超网络以在测试时进行高效有效的对齐。与修改潜在状态不同,HyperAlign 动态生成低秩适应权重来调节扩散模型的生成运算符。这使得去噪轨迹可以根据输入潜变量、时间步和提示(基于奖励条件)自适应地调整。 我们引入了多个版本的HyperAlign,这些版本在超网络应用频率上有所不同,从而在性能与效率之间取得平衡。此外,我们使用带有偏好数据正则化的奖励评分目标来优化超网络以减少奖励操控的可能性。我们在包括Stable Diffusion和FLUX在内的多种扩展生成范式上对HyperAlign进行了评估。它在增强语义一致性和视觉吸引力方面显著优于现有的微调和测试时间缩放基线方法。
https://arxiv.org/abs/2601.15968
In this paper we propose the Iterative Amortized Hierarchical Variational Autoencoder (IA-HVAE), which expands on amortized inference with a hybrid scheme containing an initial amortized guess and iterative refinement with decoder gradients. We achieve this by creating a linearly separable decoder in a transform domain (e.g. Fourier space), enabling real-time applications with very high model depths. The architectural change leads to a 35x speed-up for iterative inference with respect to the traditional HVAE. We show that our hybrid approach outperforms fully amortized and fully iterative equivalents in accuracy and speed respectively. Moreover, the IAHVAE shows improved reconstruction quality over a vanilla HVAE in inverse problems such as deblurring and denoising.
在本文中,我们提出了迭代摊平分层变分自动编码器(IA-HVAE),该模型通过一种混合方案扩展了摊平推理,这种方案包含了一个初始的摊平估计和使用解码器梯度进行的迭代改进。我们通过在变换域(例如傅里叶空间)中创建一个线性可分离的解码器来实现这一点,这使得即使对于非常深的模型也能实现实时应用。这一架构的变化相对于传统的HVAE,在迭代推理方面实现了35倍的速度提升。我们展示了我们的混合方法分别在精度和速度上优于完全摊平和完全迭代的方法。此外,IA-HVAE在逆问题(如去模糊和去噪)中也表现出比普通的HVAE更好的重建质量。
https://arxiv.org/abs/2601.15894
Diffusion models have emerged as a powerful approach for multimodal motion planning in autonomous driving. However, their practical deployment is typically hindered by the inherent difficulty in enforcing vehicle dynamics and a critical reliance on accurate predictions of other agents, making them prone to safety issues under uncertain interactions. To address these limitations, we introduce DualShield, a planning and control framework that leverages Hamilton-Jacobi (HJ) reachability value functions in a dual capacity. First, the value functions act as proactive guidance, steering the diffusion denoising process towards safe and dynamically feasible regions. Second, they form a reactive safety shield using control barrier-value functions (CBVFs) to modify the executed actions and ensure safety. This dual mechanism preserves the rich exploration capabilities of diffusion models while providing principled safety assurance under uncertain and even adversarial interactions. Simulations in challenging unprotected U-turn scenarios demonstrate that DualShield significantly improves both safety and task efficiency compared to leading methods from different planning paradigms under uncertainty.
扩散模型在自主驾驶中的多模态运动规划方面展现出了强大的能力。然而,由于难以强制执行车辆动力学以及对其他交通参与者准确预测的严重依赖,它们的实际部署通常会受到阻碍,并且在这种不确定交互中容易引发安全问题。为了应对这些局限性,我们引入了DualShield框架,这是一个结合了哈密顿-雅可比(Hamilton-Jacobi, HJ)可达值函数的规划和控制框架,在双重能力下使用该值函数。 首先,价值函数作为前瞻性指导,引导扩散去噪过程朝向安全且动力学上可行的区域。其次,它们通过构建基于控制屏障值函数(Control Barrier-Value Functions, CBVFs)的安全防护层来修改执行的动作并确保安全性。这种双重机制不仅保留了扩散模型丰富的探索能力,在不确定甚至对抗性交互中还提供了原则性的安全保障。 在具有挑战性的无保护左转场景中的模拟实验表明,与来自不同规划范式的领先方法相比,DualShield在不确定性条件下显著提高了安全性和任务效率。
https://arxiv.org/abs/2601.15729
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier's baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.
去噪扩散概率模型(DDPMs)在高保真图像生成方面取得了显著成功。然而,评估其语义可控性——特别是在细粒度、单一领域任务中——仍然具有挑战性。标准指标如FID和Inception Score (IS) 在这种专门的上下文中往往无法检测到身份错配问题。在这项工作中,我们研究了用于K-pop偶像脸部生成(32x32)的条件类DDPMs,这是一个以高度的跨类别相似性为特征的领域。我们提出了一种校准指标——相对分类准确性(RCA),该指标将生成性能归一化到一个oracle分类器的基础线上。我们的评估揭示了一个关键的权衡:虽然模型在视觉质量上表现出色(FID 8.93),但它却遭受了严重的语义模式坍缩问题(RCA 0.27),特别是在涉及视觉模糊身份的情况下。我们通过混淆矩阵分析这些失败模式,并将其归因于分辨率限制和性别内部的模棱两可。我们的框架为验证条件生成模型中的身份一致性提供了一个严格的基准。
https://arxiv.org/abs/2601.15560
Deep learning models, particularly Transformers, are often criticized as "black boxes" and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ($\text{MCR}^2$). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce two physical constraints: an overcomplete dictionary to expand the representational phase space, and an irrational frequency separation ($\pi$-RoPE) to enforce incoherence between signal and noise subspaces. We demonstrate that these geometric inductive biases can be viewed as a physical constraint and they are sufficient to induce unsupervised functional disentanglement alone. Using TinyStories as a controlled testbed for verifying spectral dynamics, we observe that Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints (noise). Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled geometric construction.
深度学习模型,尤其是Transformer模型,经常因其“黑箱”特性而受到批评,缺乏可解释性。我们提出了一种名为Prism的架构,这是一种基于注意力机制的白盒结构,源自最大化编码率减少($\text{MCR}^2$)原则。通过将注意力机制建模为在特定信号噪声流形上的梯度上升过程,引入了两个物理约束:使用一个过度完备字典来扩展表示相空间,并采用无理频率分离($\pi$-RoPE)以强制信号和噪声子空间之间的不一致性。 我们演示了这些几何归纳偏差可以被视为一种物理约束,并且单独足以诱导无监督的功能解耦。利用TinyStories作为验证频谱动力学的受控测试平台,我们观察到Prism会自发地将其注意力头部专业化为具有不同频率特性的模式:低频头部捕捉长距离因果依赖关系(信号),高频头部处理局部句法约束(噪声)。 我们的结果表明,可解释性和性能并不一定是相互排斥的,而是可以通过原则性几何构建进行统一。
https://arxiv.org/abs/2601.15540
In recent years, Rectified flow (RF) has gained considerable popularity largely due to its generation efficiency and state-of-the-art performance. In this paper, we investigate the degree to which RF automatically adapts to the intrinsic low dimensionality of the support of the target distribution to accelerate sampling. We show that, using a carefully designed choice of the time-discretization scheme and with sufficiently accurate drift estimates, the RF sampler enjoys an iteration complexity of order $O(k/\varepsilon)$ (up to log factors), where $\varepsilon$ is the precision in total variation distance and $k$ is the intrinsic dimension of the target distribution. In addition, we show that the denoising diffusion probabilistic model (DDPM) procedure is equivalent to a stochastic version of RF by establishing a novel connection between these processes and stochastic localization. Building on this connection, we further design a stochastic RF sampler that also adapts to the low-dimensionality of the target distribution under milder requirements on the accuracy of the drift estimates, and also with a specific time schedule. We illustrate with simulations on the synthetic data and text-to-image data experiments the improved performance of the proposed samplers implementing the newly designed time-discretization schedules.
近年来,修正流(RF)因其生成效率和最先进的性能而广受欢迎。在本文中,我们探讨了RF如何自动适应目标分布支撑集的内在低维性以加速采样的程度。研究表明,在采用精心设计的时间离散化方案并且具有足够准确的漂移估计的情况下,RF采样器的迭代复杂度为$O(k/\varepsilon)$(忽略对数因子),其中$\varepsilon$是总变差距离中的精度,而$k$为目标分布的内在维数。此外,我们展示了去噪扩散概率模型(DDPM)过程与修正流的一个随机版本等价,并通过建立这些过程和随机定位之间的新连接证明了这一点。基于这种联系,我们在对漂移估计准确性的较弱要求下设计了一个随机RF采样器,该采样器还适应于目标分布的低维性,并且具有特定的时间计划。我们通过在合成数据和文本到图像的数据实验中使用所提出的实现新设计时间离散化方案的采样器来展示了改进的效果。
https://arxiv.org/abs/2601.15500
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at this https URL
文本到图像(T2I)模型取得了显著的进步,但在处理需要同时处理多个对象、关系和属性的复杂提示时仍然面临挑战。现有的推理时间策略,如使用验证器进行并行采样或简单增加去噪步骤,可以改善提示对齐但依然不足以应对具有许多约束条件的丰富组合设置。受大型语言模型中链式思维推理成功的启发,我们提出了一种迭代测试时间策略,在该策略下,T2I模型在多个步骤中逐步改进其生成结果,并由循环中的视觉-语言模型作为批评者提供反馈指导。我们的方法简单、无需外部工具或先验条件,且可以灵活应用于各种图像生成器和视觉-语言模型。通过实证研究,我们证明了在基准测试上图像生成的一致性提升:ConceptMix(k=7)的全正确率提高了16.9%,T2I-CompBench(3D-Spatial类别)提升了13.8%,以及Visual Jenga场景分解提升了12.5%,与计算量匹配的并行采样相比。除了定量改进,迭代细化通过将复杂提示分解为顺序修正来生成更忠实的结果,并且人类评估者在58.7%的时间内更倾向于我们的方法,相比之下,41.3%的时间偏向于并行基线方法。综上所述,这些发现突显了迭代自我纠正作为组合图像生成广泛适用的原则。结果和可视化可以在提供的链接中找到。
https://arxiv.org/abs/2601.15286
Traditional defenses against Deep Leakage (DL) attacks in Federated Learning (FL) primarily focus on obfuscation, introducing noise, transformations or encryption to degrade an attacker's ability to reconstruct private data. While effective to some extent, these methods often still leak high-level information such as class distributions or feature representations, and are frequently broken by increasingly powerful denoising attacks. We propose a fundamentally different perspective on FL defense: framing it as a spoofing this http URL introduce SpooFL (Figure 1), a spoofing-based defense that deceives attackers into believing they have recovered the true training data, while actually providing convincing but entirely synthetic samples from an unrelated task. Unlike prior synthetic-data defenses that share classes or distributions with the private data and thus still leak semantic information, SpooFL uses a state-of-the-art generative model trained on an external dataset with no class overlap. As a result, attackers are misled into recovering plausible yet completely irrelevant samples, preventing meaningful data leakage while preserving FL training integrity. We implement the first example of such a spoofing defense, and evaluate our method against state-of-the-art DL defenses and demonstrate that it successfully misdirects attackers without compromising model performance significantly.
在聯邦學習(FL)中,傳統防禦深度洩露(DL)攻擊的方法主要集中在混淆、引入噪聲、變換或加密上,以降低攻擊者重建私人數據的能力。這些方法在一定程度上有效,但通常仍會泄漏高級信息如類別分布或特徵表示,並且常常被越來越強大的去噪攻擊破解。我們提出了一種全新的FL防禦觀點:將其視為一種欺騙技術。為此,我們引進了SpooFL(圖1),一種基於欺騙的防禦方法,它讓攻擊者誤以為他們已獲得了真正的訓練數據,而實際上提供的是與任何相關任務無關的、但看起來真實可信的合成樣本。 與先前那些使用與私人數據共享類別或分佈的合成數據防禦不同,SpooFL 使用在沒有任何類別重疊的外部數據集上進行訓練的最新生成模型。因此,攻擊者被誤導以獲得看似合理的但完全不相關的樣本,從而防止了有意義的數據洩漏,同時保持了聯邦學習訓練的有效性。 我們實現了這種欺騙防禦方法的第一個示例,並對抗最新的DL防禦進行評估,結果顯示該方法能夠有效地誤導攻擊者而不顯著影響模型性能。
https://arxiv.org/abs/2601.15055
Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.
扩散模型由于其出色的生成能力和灵活性,在文本驱动的人体动作生成及相关任务中得到了广泛应用。然而,当前的运动扩散模型面临两大主要限制:由缺乏特定运动信息的预训练文本编码器导致的表现差距,以及在迭代去噪过程中产生的误差传播问题。本文介绍了一种名为重建锚定扩散模型(RAM)的方法来解决这些问题。 首先,RAM 通过一个运动潜在空间作为从文本到动作生成的中间监督机制。为此,RAM 同时训练了一个运动重构分支,并提出了两个关键目标函数:自正则化以增强对运动空间的判别能力,以及基于运动中心的潜在对齐以实现从文本到运动潜在空间的准确映射。 其次,我们提出了一种测试阶段指导机制——重构误差引导(REG),利用扩散模型固有的自我校正能力来减轻误差传播问题。在每次去噪步骤中,REG 使用运动重构分支来重构前一估计值,并复制先前的误差模式。通过放大当前预测与重构估计之间的残差,REG 强调了当前预测中的改进。 广泛的实验表明,RAM 实现了显著的性能提升并达到了最先进的水平。我们的代码将会开源发布。
https://arxiv.org/abs/2601.14788
Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.
人工智能生成内容(AIGC)取得了显著进展,其中高分辨率的文本到图像(T2I)生成变得越来越重要,以提升用户的体验质量(QoE)。尽管资源受限的边缘计算能够支持快速生成低分辨率的T2I输出,但要实现高质量的高分辨率图像仍面临延迟和图像保真度之间的权衡。为了解决这一问题,我们首先研究了超分辨率(SR)方法在图像增强方面的性能,发现了一个基本的折中:轻量级学习型SR难以恢复精细细节,而基于扩散的SR虽然能够达到更高的图像保真度,但计算成本较高。 鉴于这些观察结果,我们提出了一种端边协作生成与增强框架。当系统收到一个T2I生成任务时,在边缘侧根据自适应选择的去噪步骤和超分辨率尺度先生成一张低分辨率图片。接着将该图片分割成多个补丁,并采用一种区域感知混合SR策略进行处理:此策略使用基于扩散的SR模型对前景补丁进行细节恢复,同时利用轻量级学习型SR模型快速扩大背景补丁尺寸,最终将其拼接为高分辨率图像。 实验表明,与基准系统相比,我们的框架在保持竞争性图像质量的同时,能够将服务延迟降低33%。
https://arxiv.org/abs/2601.14741
Offline black-box optimization (BBO) aims to find optimal designs based solely on an offline dataset of designs and their labels. Such scenarios frequently arise in domains like DNA sequence design and robotics, where only a few labeled data points are available. Traditional methods typically rely on task-specific proxy or generative models, overlooking the in-context learning capabilities of pre-trained large language models (LLMs). Recent efforts have adapted autoregressive LLMs to BBO by framing task descriptions and offline datasets as natural language prompts, enabling direct design generation. However, these designs often contain bidirectional dependencies, which left-to-right models struggle to capture. In this paper, we explore diffusion LLMs for BBO, leveraging their bidirectional modeling and iterative refinement capabilities. This motivates our in-context denoising module: we condition the diffusion LLM on the task description and the offline dataset, both formatted in natural language, and prompt it to denoise masked designs into improved candidates. To guide the generation toward high-performing designs, we introduce masked diffusion tree search, which casts the denoising process as a step-wise Monte Carlo Tree Search that dynamically balances exploration and exploitation. Each node represents a partially masked design, each denoising step is an action, and candidates are evaluated via expected improvement under a Gaussian Process trained on the offline dataset. Our method, dLLM, achieves state-of-the-art results in few-shot settings on design-bench.
离线黑盒优化(BBO)旨在仅基于设计的离线数据集及其标签来寻找最优设计方案。这种情况在DNA序列设计和机器人技术等领域能够频繁出现,这些领域中仅有少量带有标注的数据点可用。传统的做法通常依赖于特定任务的代理模型或生成模型,而忽略了预训练大规模语言模型(LLMs)中的上下文学习能力。近期的研究将自回归型LLM应用于BBO,通过以自然语言提示的形式构建任务描述和离线数据集来直接生成设计方案。然而,这些设计往往包含双向依赖关系,这是从左到右的模型难以捕捉到的。在本文中,我们探讨了使用扩散型LLM进行BBO的方法,利用其双向建模能力和迭代细化能力。这促使我们开发了一个上下文去噪模块:我们将任务描述和离线数据集(均以自然语言格式呈现)作为条件来约束扩散型LLM,并提示它将掩码设计去噪为改进后的候选方案。为了引导生成向高性能设计方案靠拢,我们引入了掩码扩散树搜索,这是一种逐步的蒙特卡洛树搜索过程,在动态平衡探索与利用之间进行操作。每个节点代表一个部分被掩码的设计,每次去噪步骤被视为一次行动,并通过基于离线数据集训练的高斯进程下的预期改进来评估候选方案。我们的方法dLLM在设计基准测试中的少量样本设置中取得了最先进的成果。
https://arxiv.org/abs/2601.14446
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
我们提出了一种基于时序差分(TD)的强化学习(RL)新算法——带有伴随匹配的Q-learning (QAM),该算法旨在解决连续动作域中长期存在的挑战:如何高效地优化具有参数化Q函数的表达性强的扩散或流匹配策略。有效的优化需要利用评估器的一阶信息,但对于流动或扩散政策来说,通过多步去噪过程进行基于梯度的直接反向传播优化却面临数值不稳定的问题。现有方法通过仅使用值而不考虑梯度信息或者依赖牺牲策略表达性或偏差学习到的策略的近似来规避这一问题。QAM则利用伴随匹配,一种最近在生成建模中提出的技巧,将评估器的动作梯度转化为一个不受数值不稳定的反向传播影响的逐步骤目标函数,并且在最优时提供无偏、表达性强的策略。结合用于评估器学习的时间差分备份方法,QAM在离线和从离线到在线RL中的硬稀疏奖励任务中始终优于先前的方法。
https://arxiv.org/abs/2601.14234
Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.
自监督学习在低剂量计算机断层扫描(LDCT)图像去噪中的应用研究日益增多,因为它减轻了对正常剂量CT(NDCT)配对数据的依赖,而这些数据往往在临床实践中难以获取。本文提出了一种新颖的仅基于LDCT图像进行训练的自监督策略。我们引入了一种逐步盲点去噪机制,在逐渐推进的过程中强制执行条件独立性,从而实现更为精细的去噪学习。此外,我们在LDCT图像上添加高斯噪声,这作为正则化手段,并减轻过拟合现象。在Mayo LDCT数据集上的大量实验表明,所提出的方法始终优于现有的自监督方法,并且其性能与几种典型的监督去噪方法相当或更优。
https://arxiv.org/abs/2601.14180
The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.
大型语言模型(LLMs)当前的范式主要由自回归(AR)架构定义,这些架构通过一种“砖块式”的顺序过程生成文本。尽管自回归模型取得了成功,但它们本质上受到因果瓶颈的影响,这限制了其全局结构预见性和迭代优化的能力。扩散语言模型(DLMs)则提供了一种变革性的替代方案,将文本生成视为一种整体的双向去噪过程,类似于雕刻家打磨杰作的过程。然而,由于常被局限在自回归遗留架构和优化框架内,扩散语言模型的巨大潜力尚未得到充分开发。 在这篇视角文章中,我们识别出了阻碍扩散语言模型实现“GPT-4时刻”的十个根本性挑战,从架构惰性和梯度稀疏性到线性推理的限制。为了应对这些挑战,我们提出了一项战略路线图,将其组织为四个支柱:基础架构、算法优化、认知推理和统一多模态智能。通过转向以扩散原生生态系统的特征,如多层次标记化、主动重掩码和隐式思维为中心的方向,我们可以超越因果时间线的限制。 我们认为这一转变对于开发下一代AI系统至关重要,这些系统能够进行复杂的结构推理、动态自我校正,并实现无缝的多模态集成。
https://arxiv.org/abs/2601.14041
Concept erasure aims to suppress sensitive content in diffusion models, but recent studies show that erased concepts can still be reawakened, revealing vulnerabilities in erasure methods. Existing reawakening methods mainly rely on prompt-level optimization to manipulate sampling trajectories, neglecting other generative factors, which limits a comprehensive understanding of the underlying dynamics. In this paper, we model the generation process as an implicit function to enable a comprehensive theoretical analysis of multiple factors, including text conditions, model parameters, and latent states. We theoretically show that perturbing each factor can reawaken erased concepts. Building on this insight, we propose a novel concept reawakening method: Latent space Unblocking for concept REawakening (LURE), which reawakens erased concepts by reconstructing the latent space and guiding the sampling trajectory. Specifically, our semantic re-binding mechanism reconstructs the latent space by aligning denoising predictions with target distributions to reestablish severed text-visual associations. However, in multi-concept scenarios, naive reconstruction can cause gradient conflicts and feature entanglement. To address this, we introduce Gradient Field Orthogonalization, which enforces feature orthogonality to prevent mutual interference. Additionally, our Latent Semantic Identification-Guided Sampling (LSIS) ensures stability of the reawakening process via posterior density verification. Extensive experiments demonstrate that LURE enables simultaneous, high-fidelity reawakening of multiple erased concepts across diverse erasure tasks and methods.
概念擦除旨在抑制扩散模型中的敏感内容,但最近的研究表明被擦除的概念仍然可以被重新激活,揭示了擦除方法的漏洞。现有的重新激活方法主要依赖于提示级别的优化来操控采样轨迹,忽略了其他生成因素,这限制了对潜在动态机制的全面理解。在本文中,我们将生成过程建模为隐式函数,以便进行多因素(包括文本条件、模型参数和潜在状态)的综合理论分析,并证明扰动每个因素都能重新激活被擦除的概念。 基于这一见解,我们提出了一种新的概念重新激活方法:Latent space Unblocking for concept REawakening (LURE),通过重构潜在空间并指导采样轨迹来重新激活被擦除的概念。具体而言,我们的语义重组机制通过将去噪预测与目标分布对齐来重建潜在空间,以重新建立断开的文本-视觉关联。 然而,在多概念场景下,简单的重建会导致梯度冲突和特征纠缠问题。为了解决这一问题,我们引入了梯度场正交化技术,强制执行特征正交性以防止相互干扰。此外,我们的基于后验密度验证的潜在语义识别引导采样(Latent Semantic Identification-Guided Sampling, LSIS)确保重新激活过程的稳定性。 广泛的实验表明,LURE能够在各种擦除任务和方法中同时、高质量地重现实现多个被擦除的概念。
https://arxiv.org/abs/2601.14330
Sparse regularization is fundamental in signal processing and feature extraction but often relies on non-differentiable penalties, conflicting with gradient-based optimizers. We propose WEEP (Weakly-convex Envelope of Piecewise Penalty), a novel differentiable regularizer derived from the weakly-convex envelope framework. WEEP provides tunable, unbiased sparsity and a simple closed-form proximal operator, while maintaining full differentiability and L-smoothness, ensuring compatibility with both gradient-based and proximal algorithms. This resolves the tradeoff between statistical performance and computational tractability. We demonstrate superior performance compared to established convex and non-convex sparse regularizers on challenging compressive sensing and image denoising tasks.
稀疏正则化在信号处理和特征提取中至关重要,但通常依赖于非可微分的惩罚项,这与基于梯度的优化器相冲突。我们提出了WEEP(弱凸包络分段惩罚),这是一种新颖的可微正则化方法,源自弱凸包络框架。WEEP提供了一种可调、无偏的稀疏性,并具有简单的闭式形式近端操作符,同时保持完全可微性和L-光滑性,确保了与基于梯度和基于近端算法的兼容性。这解决了统计性能与计算可行性之间的权衡问题。我们在具有挑战性的压缩感知和图像去噪任务中展示了WEEP相较于现有的凸和非凸稀疏正则化方法的优越性能。
https://arxiv.org/abs/2507.20447
Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
扩散模型作为图像合成中的最先进的生成方法已崭露头角,但它们作为一种通用特征编码器的潜力尚未得到充分探索。这些模型在没有标签的情况下进行去噪和生成训练,可以被视为一种自我监督学习者,能够捕捉到低级和高级结构信息。我们展示了冻结状态下的扩散骨干网络通过探测跨层和时间步长中的中间去噪特征,并针对每一对训练线性分类器,可以实现强大的细粒度识别能力。我们在一个具有实际影响的浮游生物监测的真实场景中评估了这一点,使用与已确立的监督学习和自我监督方法进行对比且一致的训练设置。在平衡和自然长尾设置下,冻结扩散特征不仅能够媲美监督基线模型,还能超越其他自监督方法。此外,在时间上和地理上偏移的浮游生物数据集上的分布外评估进一步表明,冻结扩散特征能够在显著的分布变化下保持强准确性和Macro F1分数。
https://arxiv.org/abs/2601.13416
Tactile sensing provides a promising sensing modality for object pose estimation in manipulation settings where visual information is limited due to occlusion or environmental effects. However, efficiently leveraging tactile data for estimation remains a challenge due to partial observability, with single observations corresponding to multiple possible contact configurations. This limits conventional estimation approaches largely tailored to vision. We propose to address these challenges by learning an inverse tactile sensor model using denoising diffusion. The model is conditioned on tactile observations from a distributed tactile sensor and trained in simulation using a geometric sensor model based on signed distance fields. Contact constraints are enforced during inference through single-step projection using distance and gradient information from the signed distance field. For online pose estimation, we integrate the inverse model with a particle filter through a proposal scheme that combines generated hypotheses with particles from the prior belief. Our approach is validated in simulated and real-world planar pose estimation settings, without access to visual data or tight initial pose priors. We further evaluate robustness to unmodeled contact and sensor dynamics for pose tracking in a box-pushing scenario. Compared to local sampling baselines, the inverse sensor model improves sampling efficiency and estimation accuracy while preserving multimodal beliefs across objects with varying tactile discriminability.
触觉感知为在视觉信息受限(由于遮挡或环境影响)的操作环境中进行物体姿态估计提供了有前景的感知方式。然而,由于部分可观测性和单一观测对应多种可能接触配置的问题,有效地利用触觉数据进行估算仍是一项挑战。这限制了传统主要针对视觉设计的估算方法的应用范围。我们提出通过使用去噪扩散来学习逆向触觉传感器模型的方法来解决这些问题。该模型基于分布式的触觉传感器的触觉观测,在模拟环境中通过基于符号距离场(Signed Distance Field, SDF)的几何传感器模型进行训练。在推理过程中,利用SDF的距离和梯度信息执行单步投影以强制实施接触约束。对于在线姿态估计,我们通过提案方案将逆向模型与粒子滤波器集成,该方案结合了生成假设和先前信念中的粒子。 我们在模拟和平面姿态估算的真实世界设置中验证了我们的方法,在没有视觉数据或严格的初始姿势先验的情况下进行实验。我们进一步评估了在盒子推动场景的位姿跟踪中对未建模接触和传感器动力学鲁棒性的性能。与局部采样基准相比,逆向传感器模型提高了采样效率和估计精度,并且在整个具有不同触觉分辨能力的对象上保持多模态信念。
https://arxiv.org/abs/2601.13250