Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at this https URL.
强化学习(RL)已成为端到端自动驾驶(AD)的主导范式。然而,RL在复杂场景中面临着样本效率低下的问题,并且缺乏语义可解释性。基础模型,特别是视觉-语言模型(VLMs),可以通过提供丰富的、上下文感知的知识来缓解这一问题,但它们较高的推理延迟阻碍了其在高频RL训练循环中的部署。为了弥合这种差距,我们提出了Found-RL平台,旨在高效地利用基础模型增强自动驾驶的强化学习。该平台的一个核心创新是异步批处理推理框架,它将繁重的VLM推论与仿真循环解耦,有效解决延迟瓶颈问题以支持实时学习。 此外,我们引入了多样化的监督机制:价值边缘正则化(VMR)和优势加权动作引导(AWAG),用于有效地提炼出专家级别的VLM行动建议,并将其融入到RL策略中。同时,我们也采用了高吞吐量的CLIP进行密集奖励塑造。通过条件对比动作对齐来解决CLIP动态盲点问题:它根据离散化的速度/命令来调整提示,并从上下文特定的动作-锚定评分中生成归一化、边缘基础奖金。 Found-RL提供了一个端到端的管道,用于精细调整VLM集成,并展示了轻量级的RL模型可以与数十亿参数的VLM相比拟,在保持实时推理能力(大约500 FPS)的同时实现近似性能。代码、数据和模型将在以下链接公开发布:[此处插入实际URL]。
https://arxiv.org/abs/2602.10458
Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
利用表示编码器进行生成建模提供了一种高效、高保真合成的路径。然而,标准扩散变换器无法直接在这些表示上收敛。虽然最近的工作将这一问题归因于容量瓶颈,并建议通过昂贵的宽度扩展来解决扩散变换器的问题,但我们证明了这个问题的根本原因在于几何特性。我们识别出几何干扰(Geometric Interference)是根本原因:标准欧氏流匹配强迫概率路径穿过表示编码器超球体特征空间中的低密度内部区域,而不是沿着流形表面行进。为了纠正这一问题,我们提出了黎曼流匹配与雅可比正则化(Riemannian Flow Matching with Jacobi Regularization, RJF)。通过将生成过程约束在流形测地线上,并修正曲率引起的误差传播,RJF使标准扩散变换器架构能够在不进行宽度扩展的情况下收敛。我们的方法RJF使得标准DiT-B架构(1.31亿参数)能够有效收敛,达到了先前方法无法达到的FID 3.37。 代码链接:[这个URL]
https://arxiv.org/abs/2602.10099
Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
大多数自监督学习(SSL)方法通过将同一输入的不同视图对齐来学习连续的视觉表示,从而在如何跨表示维度结构化信息方面提供了有限的控制。在这项工作中,我们将视觉自监督学习框架化为教师网络和学生网络之间的离散通信过程,在一个固定容量的二进制信道中传输语义信息。与对齐连续特征不同,学生预测由教师生成的多标签二元消息。通过逐元素的二元交叉熵目标强制执行离散一致,并且编码率正则化项鼓励有效利用受限制的通道,从而促进结构化的表示形式。我们进一步表明,定期重新初始化投影头会加强这种效果,从而鼓励在多个离散编码中保持预测性的嵌入。 广泛的实验结果证明,在图像分类、检索和密集视觉预测任务上,以及通过自监督适应处理域变化时,与连续一致的基准相比,该方法实现了持续改进。除了骨干表示之外,我们还分析了所学习的二进制代码,并展示了它们形成了一种紧凑且具有信息量的离散语言,能够捕捉跨类可重用的语义因素。
https://arxiv.org/abs/2602.09764
While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets three dimensions of VLA scaling. (1) Physical alignment: we show that a unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer. (2) Embodiment mixture: we find that naively pooling heterogeneous robot datasets often induces negative transfer rather than gains, underscoring the fragility of indiscriminate data scaling. (3) Training regularization: we observe that intuitive strategies, such as sensory dropout and multi-stage fine-tuning, do not consistently improve performance at scale. Together, this study challenge some common assumptions about embodied scaling and provide practical guidance for training large-scale VLA policies from diverse robotic data. Project website: this https URL
虽然视觉-语言-行动(VLA)模型在通用机器人控制方面显示出巨大的潜力,但尚不清楚这些标准的“扩大数据规模”策略是否适用于机器人领域,在该领域中训练数据由于不同的身体形态、传感器和动作空间而具有内在异质性。我们通过一个系统且受控的研究来探究视觉-语言-行动(VLA)模型在多样化机器人群体中的扩展能力,重新审视了预训练过程的核心选择。使用结合了视觉语言主干网络与流匹配技术的代表性VLA框架,在相同条件下消融关键设计决策,并通过广泛的模拟和真实机器人实验进行评估。 为了提高实际结果的可靠性,我们引入了一种分组盲集协议,该协议使操作者无法识别模型的身份,并将策略执行与结果判断分开,从而减少了实验者的偏差。我们的分析重点关注了VLA扩展的三个维度: 1. 物理对齐:我们证明了一个统一的末端效应器(EEF)相对动作表示对于跨身体形态转移至关重要。 2. 身体组合:我们发现简单地合并异构机器人数据集通常会引发负面迁移,而不是带来收益,这强调了不加选择的数据扩展的脆弱性。 3. 训练正则化:我们观察到直观策略(如感官dropout和多阶段微调)在扩大规模时并不能一致提高性能。 综上所述,这项研究挑战了一些关于实体扩展的常见假设,并为从多样化机器人数据训练大规模VLA政策提供了实际指导。项目网站链接:[此URL]
https://arxiv.org/abs/2602.09722
Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SOFT-FLOW on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation where standard methods struggle.
真实世界的灵巧操作策略的微调仍面临挑战,主要是由于现实世界中的互动预算有限和动作分布高度多模态。基于扩散的方法虽然具有表达性,但在精细调整过程中不允许保守的概率更新,因为动作概率难以计算得出。相比之下,传统的高斯策略在处理多模态时表现不佳,尤其是在动作以块状执行的情况下,并且标准的逐步批评方法无法与块状执行对齐,导致信用分配效果差。 为了解决这些挑战,我们提出了SOFT-FLOW,这是一个采用归一化流(NF)的样本高效离策略微调框架。该归一化流策略能够生成多模态动作序列的确切概率,从而通过似然正则化实现保守且稳定的策略更新,提高了样本效率。一个基于动作块的方法评估整个动作序列,与策略的时间结构对齐,并改进了长期时序的信用分配。 据我们所知,这是首次在真实的机器人硬件上展示了一个结合多模态生成性策略和按级别价值学习的概率基础方法。我们在两个具有挑战性的灵巧操作任务中测试了SOFT-FLOW:使用从盒子里取出的剪刀裁剪胶带,以及用掌心向下的抓握方式旋转立方体——这两个任务都需要在长时间尺度上进行精细、灵巧的操作控制。 在这两项任务中,SOFT-FLOW能够实现稳定且样本高效的适应性调整,在这些方面标准方法表现较差。
https://arxiv.org/abs/2602.09580
Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method's state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.
半监督学习通过利用未标记的数据,缓解了图像分割所需的大量像素级标注数据的需求。在医学影像分析中,高质量标签数据的稀缺仍然是一个主要挑战,这主要是由于标注成本高昂以及需要专业的临床知识所致。半监督学习展示出了解决这一瓶颈的巨大潜力,伪标签和一致性正则化作为两个主流范式已经显现出来。 双任务协同学习是一种新兴的一致性感知范式,旨在通过建立相关任务之间的预测一致性来获取额外的监督信息。然而,当前的方法仅限于单向交互机制(通常为回归到分割),因为分割结果只能离线转换成回归输出,从而无法充分挖掘在线双向跨任务协作的潜在优势。 为此,我们提出了一种全可微双向协同学习(DBiSL)框架,该框架无缝整合并增强了四个关键的半监督学习组件:有监督学习、一致性正则化、伪监督学习和不确定性估计。在两个基准数据集上的实验表明了我们方法的最先进的性能。除技术贡献外,这项工作还为统一的半监督学习框架设计提供了新的见解,并建立了一种新型架构基础,适用于双任务驱动的半监督学习,并提供了一个通用的多任务学习框架,可应用于更广泛的计算机视觉应用。 论文接受后,代码将在GitHub上发布。
https://arxiv.org/abs/2602.09378
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.
高质量的医学影像数据集对于训练深度学习模型至关重要,但未经授权使用这些数据集会引发严重的版权和伦理问题。现有的数据集所有权验证方法针对自然图像设计时,在处理动态变化和高分辨率扫描(这些扫描具有有限的视觉多样性及细微的解剖结构)方面面临挑战,因为固定尺度图像中生成的静态水印图案难以适应这种特性,并且在保持诊断质量的同时很难做到这一点。 本文提出了一种名为X-Mark的方法,这是一种针对胸部X光影像版权保护的样本特定清洁标签水印技术。具体来说,X-Mark使用条件U-Net来为每个样本中的显著区域生成独特的扰动。我们设计了多组件训练目标以确保水印的有效性和鲁棒性,在动态缩放过程中保持诊断质量和视觉可区分性的同时减少高频扰动的影响,并通过拉普拉斯正则化实现水印的尺度不变性。 所有权验证在黑盒环境中进行,用于检测可疑模型中的特征行为。在CheXpert数据集上的广泛实验验证了X-Mark的有效性,实现了100%的误用率(WSR)和减少了Ind-M场景下的误报概率12%,同时展示了对潜在自适应攻击的抵抗力。 简言之: - X-Mark为胸部X光影像提供了独特的版权保护方案。 - 使用条件U-Net生成样本特定扰动,确保水印在不同尺度下保持有效性和鲁棒性,并维持诊断质量与视觉可区分度。 - 引入拉普拉斯正则化以减少高频扰动的影响,提高水印的规模不变性。 - 在黑盒设置中执行所有权验证,通过检测模型的行为特征来识别潜在的侵权行为。
https://arxiv.org/abs/2602.09284
Accurate simulation of turbulent flows is fundamental to scientific and engineering applications. Direct numerical simulation (DNS) offers the highest fidelity but is computationally prohibitive, while existing data-driven alternatives struggle with stable long-horizon rollouts, physical consistency, and faithful simulation of small-scale structures. These challenges are particularly acute in three-dimensional (3D) settings, where the cubic growth of spatial degrees of freedom dramatically amplifies computational cost, memory demand, and the difficulty of capturing multi-scale interactions. To address these challenges, we propose a Physics-Enhanced Swin Transformer (PEST) for 3D turbulence simulation. PEST leverages a window-based self-attention mechanism to effectively model localized PDE interactions while maintaining computational efficiency. We introduce a frequency-domain adaptive loss that explicitly emphasizes small-scale structures, enabling more faithful simulation of high-frequency dynamics. To improve physical consistency, we incorporate Navier--Stokes residual constraints and divergence-free regularization directly into the learning objective. Extensive experiments on two representative turbulent flow configurations demonstrate that PEST achieves accurate, physically consistent, and stable autoregressive long-term simulations, outperforming existing data-driven baselines.
精确模拟湍流流动对于科学和工程应用至关重要。直接数值模拟(DNS)提供了最高的精度,但计算成本极高,而现有的数据驱动方法在长时间稳定预测、物理一致性以及小尺度结构的忠实模拟方面存在挑战。这些问题在三维(3D)环境中尤为突出,其中空间自由度呈立方增长,大大增加了计算成本、内存需求以及捕捉多尺度相互作用的难度。 为了解决这些难题,我们提出了一种用于三维湍流模拟的增强物理Swin变压器(PEST)方法。PEST利用基于窗口的自注意力机制有效地建模局部偏微分方程(PDE)交互,并同时保持计算效率。我们引入了频率域自适应损失函数,该函数明确强调小尺度结构的重要性,从而实现对高频动力学过程更加忠实的模拟。为了提高物理一致性,我们在学习目标中直接纳入了纳维-斯托克斯残差约束和无散度正则化。 在两个具有代表性的湍流流动配置上的广泛实验表明,PEST实现了准确、物理一致且稳定的自回归长期预测,并超越了现有的数据驱动基准方法。
https://arxiv.org/abs/2602.10150
Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.
奖励模型是通过强化学习(RL)使语言模型与人类偏好对齐的核心。随着RL越来越多地应用于可验证的奖励和多目标对齐等场景中,奖励模型被期望编码更复杂且多元化的偏好分布。然而,训练好的分类器型奖励模型在测试时仍保持静态不变,这限制了其适应性。我们提出了一种新颖的贝叶斯奖励建模方法——变分情境奖励建模(ICRM),该方法通过情境偏好演示实现了测试时间上的可调节性。ICRM将奖励建模视为对Bradley-Terry模型下潜藏偏好的概率进行近似变分推断的过程,其中使用共轭Beta先验。 我们展示了在单一和多目标场景中,ICRM如何适应未曾见过的偏好分布。通过更多的情境演示,在单一目标设置下,ICRM在SafeRLHF上的准确率提高了34%,在RM-Bench上提高了9%;而在有助于建立帕累托前沿线的任务中,它通过提高4%的有效性(hypervolume)展示了其在帮助性和拒绝基准测试中的表现。此外,我们还研究了ICRM在强化学习训练中的实际应用可能性,表明它可以通过在数学推理方面超越传统奖励模型来有效编码可验证的奖励。 最后,我们提供了理论保证:变分目标函数允许以有限的信心达到全局内部最优解,并分析了KL正则化如何缓解过度优化奖励的问题。
https://arxiv.org/abs/2602.08819
Time series forecasting models often lack interpretability, limiting their adoption in domains requiring explainable predictions. We propose \textsc{FreqLens}, an interpretable forecasting framework that discovers and attributes predictions to learnable frequency components. \textsc{FreqLens} introduces two key innovations: (1) \emph{learnable frequency discovery} -- frequency bases are parameterized via sigmoid mapping and learned from data with diversity regularization, enabling automatic discovery of dominant periodic patterns without domain knowledge; and (2) \emph{axiomatic frequency attribution} -- a theoretically grounded framework that provably satisfies Completeness, Faithfulness, Null-Frequency, and Symmetry axioms, with per-frequency attributions equivalent to Shapley values. On Traffic and Weather datasets, \textsc{FreqLens} achieves competitive or superior performance while discovering physically meaningful frequencies: all 5 independent runs discover the 24-hour daily cycle ($24.6 \pm 0.1$h, 2.5\% error) and 12-hour half-daily cycle ($11.8 \pm 0.1$h, 1.6\% error) on Traffic, and weekly cycles ($10\times$ longer than the input window) on Weather. These results demonstrate genuine frequency-level knowledge discovery with formal theoretical guarantees on attribution quality.
时间序列预测模型通常缺乏可解释性,这限制了它们在需要可解释预测的领域中的应用。我们提出了一种名为\textsc{FreqLens} 的可解释预测框架,该框架能够发现并归因于学习到的频率成分的预测结果。 \textsc{FreqLens} 引入了两个关键创新:(1) 可学习频谱发现——通过sigmoid映射对频率基进行参数化,并从数据中带有多样性正则化的学习,从而能够在无需领域知识的情况下自动发现主要周期模式;(2) 轴向频谱归因——一个具有理论依据的框架,该框架能够证明满足完备性、忠实度、零频和对称性的公理,并且每种频率上的归因与Shapley值等价。 在交通数据集和天气数据集中,\textsc{FreqLens} 达到了竞争或优越的表现水平,同时发现了物理上有意义的频率:所有5次独立运行均发现了24小时的日周期($24.6 \pm 0.1$h, 误差为2.5%)以及12小时的一日半周期($11.8 \pm 0.1$h, 误差为1.6%),交通数据集中的所有这些周期,且在天气数据集中发现了周循环(是输入窗口长度的十倍)。这些结果表明了真正的频谱级知识发现,并且归因质量具有正式理论保证。
https://arxiv.org/abs/2602.08768
Inverse rendering aims to decompose a scene into its geometry, material properties and light conditions under a certain rendering model. It has wide applications like view synthesis, relighting, and scene editing. In recent years, inverse rendering methods have been inspired by view synthesis approaches like neural radiance fields and Gaussian splatting, which are capable of efficiently decomposing a scene into its geometry and radiance. They then further estimate the material and lighting that lead to the observed scene radiance. However, the latter step is highly ambiguous and prior works suffer from inaccurate color and baked shadows in their albedo estimation albeit their regularization. To this end, we propose RotLight, a simple capturing setup, to address the ambiguity. Compared to a usual capture, RotLight only requires the object to be rotated several times during the process. We show that as few as two rotations is effective in reducing artifacts. To further improve 2DGS-based inverse rendering, we additionally introduce a proxy mesh that not only allows accurate incident light tracing, but also enables a residual constraint and improves global illumination handling. We demonstrate with both synthetic and real world datasets that our method achieves superior albedo estimation while keeping efficient computation.
逆向渲染的目标是将场景分解为几何形状、材质属性和光照条件,基于某种渲染模型。它在视图合成、重照明和场景编辑等方面有广泛的应用。近年来,受神经辐射场(NeRF)和高斯斑点法(Gaussian Splatting)等视角合成方法的启发,逆向渲染方法能够高效地将场景分解为几何形状和光度信息。然后进一步估算导致观察到场景光度的材质和光照条件。然而,后者步骤高度模糊不清,并且尽管有正则化措施,之前的工作在估计反照率时仍存在色彩不准确和烘焙阴影的问题。 为此,我们提出了RotLight,这是一个简单的捕捉设置,用于解决这种模棱两可的情况。与常规捕获相比,使用RotLight只需要在整个过程中多次旋转对象即可。我们展示了即使是两次旋转也足以有效减少伪影。为了进一步改进基于2DGS(二维高斯斑点法)的逆向渲染方法,我们还引入了一个代理网格,它不仅允许准确地追踪入射光,而且还启用了一种残差约束,并且改善了全局光照处理。 通过合成数据集和真实世界数据集上的实验,证明我们的方法在保持高效计算的同时能够实现更优的反照率估计。
https://arxiv.org/abs/2602.08724
Image reconstruction in X-ray tomography is an ill-posed inverse problem, particularly with limited available data. Regularization is thus essential, but its effectiveness hinges on the choice of a regularization parameter that balances data fidelity against a priori information. We present a novel method for automatic parameter selection based on the use of two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts the regularization strength, driving an iterative reconstruction toward the smallest parameter that yields sufficient similarity between reconstructions on the two grids. The effectiveness of the proposed approach is demonstrated using real tomographic data.
在X射线断层成像中,图像重建是一个病态的逆问题,尤其是在可用数据有限的情况下。因此,正则化是必不可少的,但其有效性取决于平衡数据保真度与先验信息的选择参数。我们提出了一种基于同一问题两种不同计算离散化的自动选择参数的新方法。一个反馈控制算法动态调整正则化强度,使迭代重建朝着在两个网格上获得足够相似性的最小参数方向发展。通过使用真实的断层扫描数据证明了所提方法的有效性。
https://arxiv.org/abs/2602.08528
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
视听学习面临由于屏幕外来源和背景杂乱引起的模态不对齐问题,当前方法通常会放大不相关的区域或时刻,导致训练不稳定以及表示质量下降。为解决这一挑战,我们提出了一种基于字幕对齐和一致性引导增强的框架(CAE-AV),用于改进视听学习中的模态不对齐问题。该框架采用了两个互补模块:跨模态一致性的时空丰富化(CASTE)和字幕对齐的显著性指导丰富化(CASE)。这两个模块共同作用以缓解视听不对齐的问题。 **CASTE** 动态平衡空间与时间关系,通过评估帧级的视听一致性来确保在存在不对准时仍能捕捉到关键信息。该方法可以同时考虑前后帧的关键内容。 **CASE** 通过注入选定时空位置的跨模态语义指导,进一步缓解了视听不对齐的问题,并利用高层次的语义线索。 此外,我们设计了一些轻量级的目标函数:字幕至模态的InfoNCE、视觉-音频一致性以及熵正则化,以引导标记选择并增强跨模态语义对齐。在使用冻结骨干网络的情况下,CAE-AV在AVE(Audio-Visual Event)、AVVP(Audio-Video Visual Phrase)、AVS(Audio-Visual Scene)和AVQA(Audio-Visual Question Answering)等基准测试上达到了最先进的性能。定性分析进一步验证了其对抗视听不对齐的鲁棒性。 总之,这种新的框架通过引入跨模态一致性指导,并结合字幕对齐与显著性引导策略,在一定程度上解决了目前视听学习中普遍存在的问题,提升了整体表现和稳定性。
https://arxiv.org/abs/2602.08309
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.
理解深度学习中的简单性偏见为开发可靠的AI提供了一条有希望的路径。一种常见的度量标准,受到布尔函数分析启发的是平均敏感度(average sensitivity),它捕捉了模型对单个令牌扰动的鲁棒性。我们认为,平均敏感度有两个关键限制:它缺乏自然推广到实值领域的手段,并且无法解释我们在现代大规模语言模型中观察到的类似“联合”的输入依赖现象。 为了解决这些局限性,我们提出了噪声稳定性(noise stability)作为一个更全面的简单性指标。噪声稳定性表达了模型对同时应用于所有输入坐标的关联噪声的鲁棒性。我们提供了对单层注意力和ReLU多层感知器(MLP)层的噪声稳定性的理论分析,并通过协方差区间传播方法解决了多层传播问题。在此理论基础上,我们开发了一种实用的噪声稳定性正则化方法。 在算法任务和下一个令牌预测任务上的实验表明,我们的正则化器能够持续促进模型的学习过程(grokking),并分别将训练速度提高了约35%和75%。我们的结果揭示了神经网络中信号传播与可解释性之间的新联系,并使噪声稳定性成为理解并改进现代Transformer的重要工具。
https://arxiv.org/abs/2602.08287
Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain underexplored. In this paper, we provide several insights into such fine-tuning through the lens of linearization. Fine-tuned models are often implicitly encouraged to remain close to the pretrained model. By making this explicit, using an Euclidean distance inductive bias in parameter space, we show that fine-tuning dynamics become equivalent to learning with the positive-definite neural tangent kernel (NTK). We specifically analyze how close the fully linear and the linearized fine-tuning optimizations are, based on the strength of the regularization. This allows us to be pragmatic about how good a model linearization is when fine-tuning large language models (LLMs). When linearization is a good model, our findings reveal a strong correlation between the eigenvalue spectrum of the NTK and the performance of model adaptation. Motivated by this, we give spectral perturbation bounds on the NTK induced by the choice of layers selected for fine-tuning. We empirically validate our theory on Low Rank Adaptation (LoRA) on LLMs. These insights not only characterize fine-tuning but also have the potential to enhance PEFT techniques, paving the way to better informed and more nimble adaptation in LLMs.
参数高效微调(PEFT)是一类旨在以可扩展和资源有效的方式调整大规模模型的技术,但它们的训练性能和泛化机制仍缺乏深入研究。本文通过线性化的视角提供了对这种微调方法的一些见解。微调后的模型通常隐式地被鼓励保持接近于预训练模型的状态。通过在参数空间中使用欧几里得距离归纳偏置来使这一过程显式化,我们展示了微调动力学等同于使用正定神经切片核(NTK)进行学习的过程。 具体而言,基于正则化的强度,我们分析了完全线性优化与线性化微调优化之间的接近程度。这使得我们可以更实际地评估当大规模语言模型(LLMs)进行微调时,线性化方法的有效性如何。当我们发现这些线性化是有效的模型时,我们的研究揭示了NTK的特征值谱与模型适应性能之间存在强烈的关联。 受此启发,我们给出了由选择用于微调的层所诱导的NTK的谱扰动界。我们在大规模语言模型(LLMs)上通过低秩适应(LoRA)对这些理论进行了实证验证。这些见解不仅表征了微调过程,还有助于增强PEFT技术,并为更有效的LLMs适应铺平道路。
https://arxiv.org/abs/2602.08239
Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
文本引导的扩散模型被数百万用户使用,但很容易被用来生成有害内容。概念卸载方法旨在降低模型生成有害内容的可能性。传统上,这个问题是通过单一概念层面来解决的,只有少数近期的研究考虑了更为现实的概念组合情况。然而,现有的最先进方法依赖于完全微调,这在计算成本上十分昂贵。概念定位方法可以促进选择性微调,但现有技术静态且效果不佳。 为了应对这些挑战,我们提出了TRUST(针对性稳健的选择性微调),这是一种新的动态估算目标概念神经元并通过选择性微调卸载它们的方法,并通过基于Hessian的正则化得到增强。实验结果显示,在与多个最先进基准方法比较的情况下,TRUST能够抵抗对抗提示、显著保留生成质量,同时速度也比最先进的方法快得多。我们的方法不仅实现了单一概念的学习消除,还能够在没有特定正则化的条件下实现概念组合和条件概念的学习消除。
https://arxiv.org/abs/2602.07919
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.
几何基础模型在三维重建方面展现出巨大潜力,但其发展严重受限于多样化、大规模的三维注释数据的缺乏。尽管互联网视频提供了近乎无限的原始数据来源,但由于缺少地面真实几何信息和存在观察噪声,将其作为几何学习的扩展资源极具挑战性。为解决这一问题,我们提出了SAGE框架——一种从原始视频流中进行几何基础模型可扩展适应的方法。SAGE利用分层挖掘管道将视频转换成训练轨迹,并采用混合监督机制:(1)选择具有信息量的训练轨迹;(2)通过基于SfM点云的稀疏几何锚定提供全局结构指导;以及(3)使用三维高斯渲染实现多视图约束下的密集可微一致性。为了防止灾难性遗忘,我们提出了一种利用锚数据进行正则化的策略。 广泛的实验表明,与最先进的基准相比,SAGE显著增强了零样本泛化能力,在未见过的基准测试集上(7Scenes、TUM-RGBD和Matterport3D)减少了20%-42%的Chamfer距离。据我们所知,SAGE首次实现了通过互联网视频对几何基础模型进行适应,并为通用三维学习建立了一种可扩展范式。
https://arxiv.org/abs/2602.07891
Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.
强化学习(RL)特别是基于可验证奖励的强化学习(RLVR),已成为训练大规模语言模型(LLMs)的关键阶段,并且是当前扩展努力的重点。然而,尽管近期工作突出了RL与下一词预测阶段(如预训练和监督微调)之间的根本差异,但RL中的优化实践大多遵循这些阶段的做法。其中一个常见做法是使用AdamW优化器,该优化器虽然在大规模Transformer的训练中被广泛采用,但却带来了较高的内存开销。我们的分析表明,在RL中,与SFT(监督微调)相比,AdamW中的动量和自适应学习率的作用较小。这让我们推测,基于参数的学习速率调整和动量机制对于RL的效果不如对其他阶段有效。 通过实验验证了这一假设:在LLMs的RL过程中,众所周知的监督学习中表现不佳且内存效率更高的SGD(随机梯度下降),其性能可以与AdamW相匹敌甚至超越。值得注意的是,在仅使用不鼓励稀疏性的正则化的更新情况下,采用SGD进行完全微调时,参数更新的比例少于模型总参数的0.02%,这比AdamW要少1000多倍以上。我们的分析提供了可能导致这种更新稀疏性的潜在原因。 这些发现为LLMs中RL优化动态的新见解打开了大门,并表明RL可能比之前认识的更为参数高效。
https://arxiv.org/abs/2602.07729
Weight-only post-training quantization (PTQ) is crucial for efficient Large Language Model (LLM) deployment but suffers from accuracy degradation caused by weight and activation outliers. Existing mitigation strategies often face critical limitations: they either yield insufficient outlier suppression or incur significant deployment inefficiencies, such as inference latency, heavy preprocessing, or reliance on complex operator fusion. To resolve these limitations, we leverage a key insight: over-parameterized LLMs often converge to Flat Minima, implying a vast equivalent solution space where weights can be adjusted without compromising accuracy. Building on this, we propose Astro, an Activation-guided Structured Regularization framework designed to suppress the negative effects of outliers in a hardware-friendly and efficient manner. Leveraging the activation-guided regularization objective, Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations without sacrificing model accuracy. Crucially, Astro introduces zero inference latency and is orthogonal to mainstream quantization methods like GPTQ. Extensive experiments show that Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
仅基于权重的后训练量化(PTQ)对于高效部署大型语言模型(LLM)至关重要,但它会因权重和激活异常值导致准确率下降。现有的缓解策略通常面临重大限制:要么无法有效抑制异常值,要么在部署效率方面存在问题,例如推理延迟、沉重的预处理需求或依赖复杂的操作融合。为了解决这些问题,我们利用了一个关键见解:过度参数化的LLM常常收敛到平坦最小值,这意味着存在一个巨大的等效解空间,在这个空间中可以调整权重而不影响准确性。基于这一发现,我们提出了Astro框架,这是一种激活导向的结构化正则化框架,旨在以硬件友好且高效的方式抑制异常值带来的负面影响。通过利用激活导向的正则化目标,Astro能够主动重构内在稳健的权重,并积极抑制对应高幅度激活的权重异常值,同时不牺牲模型准确性。尤为重要的是,Astro不会引入推理延迟,并且与主流量化方法(如GPTQ)兼容。广泛的实验表明,Astro实现了高度竞争性的性能;特别地,在LLaMA-2-7B上,它通过大约1/3的量化时间,达到了比复杂的学习旋转方法更好的表现。
https://arxiv.org/abs/2602.07596
Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).
扩散大型语言模型(DLLMs)在处理可变长度生成时天生不适合,因为它们的推理是在固定长度的画布上定义的,并且隐含地假设目标长度是已知的。当长度未知时,在实际完成和填充任务中简单地跨掩码长度比较置信度会导致系统性偏差,从而导致生成不足或冗余延续。在本文中,我们展示了这种失败源于生成置信度估计中的固有长度偏见,这使得现有DLLMs无法可靠地确定生成长度,并使可变长度推理变得不可靠。 为了解决这个问题,我们提出了LR-DLLM(长度正则化推断框架),该框架将生成长度视为明确变量,在推理时实现可靠的长度确定。通过显式的长度正则化,它将语义兼容性与由长度引起的不确定性解耦,并纠正了有偏置的置信度估计。 基于此,LR-DLLM能够在不修改基础DLLM及其训练程序的情况下动态扩展或收缩生成范围。实验表明,在完全未知长度下,LR-DLLM在HumanEvalInfilling任务上的Pass@1得分达到了51.3%(相比DreamOn高出+13.4%),并在四种语言的McEval任务上平均达到51.5% Pass@1得分(相比DreamOn高出+14.3%)。
https://arxiv.org/abs/2602.07546