2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
2D高斯点阵(2DGS)是一种新兴的显式场景表示方法,由于其高保真度和高压缩比,在图像压缩方面具有巨大的潜力。然而,现有的低光增强算法主要在像素域内操作。处理通过2DGS压缩的图像需要一个复杂的解压-增强-重新压缩管道流程,这不仅效率低下,还会引入二次退化问题。为了克服这些限制,我们提出了LL-GaussianImage框架,这是首个针对低光图像直接在2DGS压缩表示域内进行无监督零样本增强处理的方法。 该框架提供了三个主要优势: 1. 设计了一个基于语义引导的专家混合增强框架。通过使用渲染图像作为指导,在不完全解压到像素网格的情况下对2DGS的稀疏属性空间应用动态自适应变换,从而实现压缩与增强一体化的效果。 2. 建立了一种多目标协作损失函数系统,严格限制了在增强过程中保持平滑度和保真度的要求。这种方法不仅可以抑制伪影,还可以提高视觉质量。 3. 采用两阶段优化过程来实现重建即增强的目标。通过单尺度重建确保基础表示的准确性,并加强网络鲁棒性,在维持高压缩比的同时实现了低光图像的高质量增强效果。 实验结果验证了直接在压缩表示域内进行处理的可行性和优越性,展示了LL-GaussianImage框架的有效性。
https://arxiv.org/abs/2601.15772
Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
在现实世界中部署强化学习(Reinforcement Learning,RL)仍然面临挑战,主要是由于样本效率低下、稀疏奖励和嘈杂的视觉观测。先前的研究利用演示和人类反馈来提高学习效率和鲁棒性。然而,离线到在线的方法需要大量的数据集并且可能不稳定,而视觉语言助手辅助的强化学习则依赖于大规模预训练和微调。因此,目前还没有出现一种低成本、对数据要求极低的真实世界RL方法。 我们介绍了\textbf{SigEnt-SAC},这是一种无策略(off-policy)演员-评论家(actor-critic)方法,仅通过一条专家轨迹就能从头开始学习。我们的关键设计是一个带有sigmoid边界项的熵条款,该条款可以防止由负熵驱动的优化朝向分布外动作,并且能够减少Q函数的振荡现象。 我们在D4RL任务上对SigEnt-SAC进行了基准测试,并与代表性基线进行了对比。实验结果显示,SigEnt-SAC显著缓解了Q函数的振荡问题,并且比先前的方法更快地达到了100%的成功率。 最后,在四个跨多身体形态的真实世界机器人任务中验证了SigEnt-SAC的有效性,其中代理从原始图像和稀疏奖励进行学习;结果表明,SigEnt-SAC能够在只有少量真实世界交互的情况下学习成功的策略,这暗示了一条低成本且实用的路径来部署现实世界的RL。
https://arxiv.org/abs/2601.15761
3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at this https URL.
三维占用预测在自动驾驶领域中扮演着至关重要的角色,因为它提供了对驾驶环境的全面理解。大多数现有的方法构建密集场景表示来进行占用预测,却忽视了真实世界驾驶场景内在的稀疏性。最近,3D超二次体(superquadric)表示作为一种有前景的稀疏替代方案出现,可以弥补密集场景表示的不足,因为超二次体具有强大的几何表达能力。然而,现有的超二次体框架仍然在时间建模不足、查询稀疏性和几何表达力之间的艰难权衡以及不高效的超二次体到体素转换(splatting)方面存在问题。 为了解决这些问题,我们提出了SuperOcc,这是一个基于超二次体的三维占用预测的新框架。SuperOcc整合了三个关键设计:(1) 一个连贯的时间建模机制,同时利用以视点为中心和以对象为中心的时间线索;(2) 多个超二次体解码策略,以增强几何表达力而不牺牲查询稀疏性;以及 (3) 一种高效的从超二次体到体素的转换方案,从而提高计算效率。在SurroundOcc和Occ3D基准测试上进行的一系列实验表明,SuperOcc达到了最先进的性能,同时保持了卓越的效率。代码可以在提供的链接处获取。
https://arxiv.org/abs/2601.15644
Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.
大型语言模型(LLMs)能够有效地调用工具,但在多轮执行中仍然存在脆弱性:在一次工具调用失败后,较小的模型往往会陷入重复无效的重新调用,无法解读错误反馈并进行自我修正。这种脆弱性阻碍了可靠的现实世界部署,在此过程中,工具交互过程中的执行错误不可避免地会出现。我们发现当前方法的关键限制在于:标准强化学习(RL)将错误视为稀疏负奖励,不提供如何恢复的具体指导;而预收集的合成错误校正数据集在与模型的实际策略误差模式分布上存在差异。为解决这一问题,我们提出了Fission-GRPO框架,该框架能够在强化学习训练循环中将执行错误转换成纠正性监督信号。我们的核心机制是通过从经过微调的Error Simulator获取诊断反馈来扩充每次失败轨迹,并在策略内重新采样恢复滚动条目,从而将其转化为新的训练实例。这使得模型能够从自己探索过程中产生的具体错误中学到东西,而不仅仅是静态预收集的错误案例。 在BFCL v4多轮测试中,Fission-GRPO将Qwen3-8B模型的错误恢复率提高了5.7%,并且总体准确度提升了4%(从42.75%提高至46.75%),这超过了GRPO和其他专门用于工具使用的代理。
https://arxiv.org/abs/2601.15625
We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and LLM-based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges LLM-based retrieval, sparse (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and LLM-based reranking. To support model training, we generate 5000 synthetic ToT queries using LLMs. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.
我们开发了一个两阶段的检索系统,该系统结合了多种互补的检索方法与基于学习的重排器和大型语言模型(LLM)重排技术,以解决TREC舌尖难题(ToT)任务。在第一阶段,我们采用了混合检索方法,将基于大型语言模型的检索、稀疏型(BM25)以及密集型(BGE-M3)检索方法结合起来。此外,我们还引入了主题感知多索引密集检索技术,该技术将维基百科语料库划分为24个不同的主题领域。在第二阶段,我们评估了训练好的LambdaMART重排器和基于大型语言模型的重排序效果。 为了支持模型训练,我们使用大型语言模型生成了5000条合成ToT查询。我们的最佳系统通过结合混合检索与Gemini-2.5-flash重排名方法,在测试集上达到了召回率为0.66、NDCG@1000为0.41的性能水平,这表明融合检索技术的有效性。
https://arxiv.org/abs/2601.15518
Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging. While Sparse Autoencoders (SAEs) have shown promise in disentangling latent representations, existing SAE-based methods for diffusion model understanding rely on unsupervised approaches that fail to align sparse features with human-understandable concepts. This limits their ability to provide reliable semantic control over generated images. We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts. CASL first trains an SAE on frozen U-Net activations to obtain disentangled latent representations, and then learns a lightweight linear mapping that associates each concept with a small set of relevant latent dimensions. To validate the semantic meaning of these aligned directions, we propose CASL-Steer, a controlled latent intervention that shifts activations along the learned concept axis. Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content. We further introduce the Editing Precision Ratio (EPR), a metric that jointly measures concept specificity and the preservation of unrelated attributes. Experiments show that our method achieves superior editing precision and interpretability compared to existing approaches. To the best of our knowledge, this is the first work to achieve supervised alignment between latent representations and semantic concepts in diffusion models.
扩散模型的内部激活包含丰富的语义信息,但解释这些表示仍然具有挑战性。虽然稀疏自编码器(Sparse Autoencoders,SAEs)在解开潜在表征方面显示出潜力,现有的基于SAE的方法用于理解扩散模型则依赖于无法将稀疏特征与人类可理解的概念对齐的无监督方法。这限制了它们提供可靠语义控制生成图像的能力。 我们引入了一种名为CASL(Concept-Aligned Sparse Latents)的监督框架,该框架能够将扩散模型中的稀疏潜在维度与语义概念进行对齐。CASL首先在冻结的U-Net激活上训练一个SAE以获得解纠缠后的潜在表征,并且然后学习一个轻量级线性映射来关联每个概念与其相关的少量潜在维度。 为了验证这些对齐方向上的语义含义,我们提出了CASL-Steer,这是一种可控的潜在干预方法,它沿着学到的概念轴调整激活。与编辑方法不同,CASL-Steer仅被用作一种因果探针来揭示概念对齐后的潜在变量如何影响生成的内容。 此外,我们引入了编辑精度比(Editing Precision Ratio,EPR),该度量同时衡量了概念的特异性和无关属性保持的程度。 实验结果表明,我们的方法在编辑精度和可解释性方面优于现有技术。据我们所知,这是首次实现扩散模型中的潜在表征与语义概念之间的监督对齐的工作。
https://arxiv.org/abs/2601.15441
Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.
基于有限X射线投影的稀疏视角锥束计算机断层扫描(CBCT)重建在医学成像中仍然是一个挑战,因为对精细解剖细节的采样不足会导致高频成分损失。传统的方法通常难以恢复这些细小结构,因为它们倾向于偏向于学习低频信息。为了解决这个问题,本文提出了DuFal(双频感知学习),这是一种新的框架,通过双路径架构将频域和空域处理相结合。其核心创新在于我们的高局部因子化傅里叶神经操作器(High-Local Factorized Fourier Neural Operator),该操作器包含两个互补分支:全局高频增强傅里叶神经操作器用于捕捉全球频率模式;局部高频增强傅里叶神经操作器则对空间划分的补丁进行处理,以保留可能在全局频域分析中丢失的空间局部性。为了提高效率,我们设计了一个光谱通道因子化方案来减少傅里叶神经操作器的参数数量。此外,还设计了跨注意力频率融合模块,用于有效地整合空域和频域特征。随后,这些融合后的特性通过一个特性解码器进行解码,以生成投影表示,并进一步通过强度场解码流水线处理重建最终的计算机断层扫描体积。 在LUNA16和ToothFairy数据集上的实验结果表明,在极端稀疏视图设置下,DuFal显著优于现有的最先进的方法,特别是在保持高频解剖特征方面。
https://arxiv.org/abs/2601.15416
We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe them jointly; we either observe $X$ or $Y$. Under appropriate regularity conditions, the problem can be cast as an instrumental variable (IV) regression with the environment acting as a (possibly high-dimensional) instrument. When there are many environments but only a few observations per environment, standard two-sample IV estimators fail to be consistent. We propose a GMM-type estimator based on cross-fold sample splitting of the instrument-covariate sample and prove that it is consistent as the number of environments grows but the sample size per environment remains constant. We further extend the method to sparse causal effects via $\ell_1$-regularized estimation and post-selection refitting.
我们研究了在隐藏混杂因素影响下估计因果效应的问题,具体是在以下未配对数据设置中进行的:我们在不同的实验条件下观察到一些协变量 $X$ 和结果 $Y$,但没有同时观察它们;我们要么观察到 $X$,要么观察到 $Y$。在适当的正则性条件(regularity conditions)下,该问题可以被表述为使用环境作为工具变量(instrumental variable, IV)的IV回归问题,这个工具变量可能是高维的。 当存在许多不同的环境但每个环境中只有少量观测值时,标准的两样本IV估计方法无法保持一致性。我们提出了一种基于工具变量-协变量样本进行交叉折叠样本分割的广义矩量法(GMM)类型的估计器,并证明了随着环境数量的增长而该估计器的一致性仍然可以维持,即使每个环境中观测值的数量保持不变。 此外,我们将此方法扩展到了稀疏因果效应的估计上,通过使用 $\ell_1$ 正则化估计和后选择重拟合技术来实现。
https://arxiv.org/abs/2601.15254
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.
通过强化学习(RL)并基于结果进行监督训练的Transformers可以自发地发展生成中间推理步骤的能力(Chain-of-Thought)。然而,稀疏奖励如何驱动梯度下降以发现这种系统性推理机制仍然不清楚。我们通过分析单层Transformer在合成图遍历任务中的梯度流动动态来解决这个问题,该任务无法仅凭直觉解决但允许简单的迭代解决方案。 我们证明了尽管训练中只关注最终答案的准确性,梯度流动仍能驱动模型收敛到一种结构化且可解释的算法上,这种算法能够逐节点地遍历图。我们还描述了这一现象出现所需的分布特性,并确定“简单示例”的关键作用:那些需要较少推理步骤的例子。当训练数据集中足够多的是这些简单的例子时,模型会学习出一种泛化的遍历策略并能推广到更长的链路上;而如果这种简单的实例在数据集中的比例下降,基于梯度的学习就变得不可行。 我们通过合成数据上的实验以及对现实世界语言模型在数学推理任务上的研究来验证我们的理论结果,并且证实了这些发现能够应用于实际场景中。
https://arxiv.org/abs/2601.15158
Dexterous grasping in cluttered environments presents substantial challenges due to the high degrees of freedom of dexterous hands, occlusion, and potential collisions arising from diverse object geometries and complex layouts. To address these challenges, we propose CADGrasp, a two-stage algorithm for general dexterous grasping using single-view point cloud inputs. In the first stage, we predict sparse IBS, a scene-decoupled, contact- and collision-aware representation, as the optimization target. Sparse IBS compactly encodes the geometric and contact relationships between the dexterous hand and the scene, enabling stable and collision-free dexterous grasp pose optimization. To enhance the prediction of this high-dimensional representation, we introduce an occupancy-diffusion model with voxel-level conditional guidance and force closure score filtering. In the second stage, we develop several energy functions and ranking strategies for optimization based on sparse IBS to generate high-quality dexterous grasp poses. Extensive experiments in both simulated and real-world settings validate the effectiveness of our approach, demonstrating its capability to mitigate collisions while maintaining a high grasp success rate across diverse objects and complex scenes.
在杂乱环境中进行灵巧抓取面临着诸多挑战,主要由于灵巧手的高自由度、遮挡以及来自多样物体几何形状和复杂布局所产生的潜在碰撞。为解决这些问题,我们提出了CADGrasp,这是一种使用单视图点云输入的一阶段算法,用于一般性的灵巧抓取任务。 在第一阶段中,我们预测稀疏IBS(与场景解耦的接触感知和碰撞感知表示)作为优化目标。稀疏IBS紧凑地编码了灵巧手与场景之间的几何关系和接触关系,使稳定且无碰撞的灵巧抓取姿态优化成为可能。为了增强这种高维表示形式的预测能力,我们引入了一种带有体素级条件指导及夹紧分数过滤器的占用率扩散模型。 在第二阶段中,基于稀疏IBS,我们开发了几种能量函数和排序策略以生成高质量的灵巧抓取姿态。 通过模拟环境与真实世界中的广泛实验验证了该方法的有效性。结果表明,在处理多样物体及复杂场景时,本方法能够有效减少碰撞并保持较高的抓取成功率。
https://arxiv.org/abs/2601.15039
Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.
混合专家(Mixture-of-Experts,MoE)架构通过将输入路由到多个专家子网络来实现条件计算,并且通常被视为扩展大型语言模型的一种机制。在这个项目中,我们研究了在图像分类设置下MoE的行为,重点是预测性能、专家利用和泛化能力。我们在CIFAR10数据集上比较了密集(dense)、软混合专家(SoftMoE)和稀疏混合专家(SparseMoE)的分类头部,在模型容量相同的情况下进行了对比。 实验结果显示,两种MoE变体在验证准确率方面略高于密集基线,并通过正则化保持了平衡的专家利用,避免了专家的崩溃。为了分析泛化能力,我们在训练和测试数据上计算了Hessian矩阵的最大特征值及迹等基于梯度曲率的衡量标准。我们发现,SoftMoE在这些指标上表现出更高的尖锐性(sharpness),而密集模型和稀疏混合专家模型则位于相似的曲率范围内,尽管所有模型都达到了相当接近的一般化性能。 进一步通过损失面扰动分析揭示了密集模型与MoE模型之间在有限参数扰动下的非局部行为具有定性的差异。这些结果帮助解释基于曲率的测量,并且有助于理解验证准确度,虽然它们并不能直接解释验证准确性。 此外,我们还评估了实际推理效率,并发现简单实现的条件路由并没有在这个规模下利用现代硬件提高推理速度,这强调了稀疏MoE模型在理论和实践上的效率差距。
https://arxiv.org/abs/2601.15021
Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named \textbf{L}ocal \textbf{D}iffusion \textbf{F}orcing for \textbf{V}ideo \textbf{F}rame \textbf{I}nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. To mitigate error accumulation inherent in auto-regressive generation, we introduce a novel skip-concatenate sampling strategy that effectively maintains temporal stability. Furthermore, LDF-VFI incorporates sparse, local attention and tiled VAE encoding, a combination that not only enables efficient processing of long sequences but also allows generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. An enhanced conditional VAE decoder, which leverages multi-scale features from the input video, further improves reconstruction fidelity. Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency, especially in scenes with large motion. The source code is available at this https URL.
现有的视频帧插值(VFI)方法通常采用以帧为中心的方法,将视频处理为独立的短片段(例如三元组),这会导致时间不一致性和运动伪影。为了克服这一问题,我们提出了一种整体、以视频为中心的范式,名为“局部扩散强制视频帧插值”(LDF-VFI)。我们的框架基于一个自回归扩散变压器构建,该变压器对整个视频序列进行建模,确保长期的时间一致性。为了解决自回归生成中固有的误差累积问题,我们引入了一种新颖的跳过拼接采样策略,有效地保持时间稳定性。此外,LDF-VFI 结合了稀疏局部注意力和贴图变分自动编码器(VAE)编码,这种组合不仅能够高效处理长序列,还在推理时允许以任意空间分辨率(例如4K)泛化而无需重新训练。增强的条件VAE解码器利用输入视频中的多尺度特征进一步提高了重建保真度。实验表明,LDF-VFI 在具有挑战性的长序列基准测试中实现了最先进的性能,展示了卓越的每帧质量和时间一致性,尤其是在大运动场景中。 该研究的源代码可在提供的链接处获取:[https URL](请注意,在实际应用中应替换为具体的网址)。
https://arxiv.org/abs/2601.14959
User interactions on e-commerce platforms are inherently diverse, involving behaviors such as clicking, favoriting, adding to cart, and purchasing. The transitions between these behaviors offer valuable insights into user-item interactions, serving as a key signal for un- derstanding evolving preferences. Consequently, there is growing interest in leveraging multi-behavior data to better capture user intent. Recent studies have explored sequential modeling of multi- behavior data, many relying on transformer-based architectures with polynomial time complexity. While effective, these approaches often incur high computational costs, limiting their applicability in large-scale industrial systems with long user sequences. To address this challenge, we propose the Transition-Aware Graph Attention Network (TGA), a linear-complexity approach for modeling multi-behavior transitions. Unlike traditional trans- formers that treat all behavior pairs equally, TGA constructs a structured sparse graph by identifying informative transitions from three perspectives: (a) item-level transitions, (b) category-level transitions, and (c) neighbor-level transitions. Built upon the structured graph, TGA employs a transition-aware graph Attention mechanism that jointly models user-item interactions and behav- ior transition types, enabling more accurate capture of sequential patterns while maintaining computational efficiency. Experiments show that TGA outperforms all state-of-the-art models while sig- nificantly reducing computational cost. Notably, TGA has been deployed in a large-scale industrial production environment, where it leads to impressive improvements in key business metrics.
电子商务平台上用户与商品的互动是多样的,包括点击、收藏、加入购物车和购买等行为。这些行为之间的转换提供了关于用户对物品兴趣变化的重要见解。因此,越来越多的研究开始关注如何利用多种行为数据来更好地捕捉用户的意图。最近的一些研究探索了序列化建模方法来处理多行为数据,并且许多方法都依赖于具有多项式时间复杂度的Transformer架构。尽管效果显著,这些方法往往伴随着较高的计算成本,在大规模工业系统中面对长用户序列时应用受限。 为了解决这一挑战,我们提出了一种基于图注意力网络(TGA)的方法,该方法能够在建模多行为转换的同时保持线性的时间复杂度。不同于传统的Transformer架构会平等处理所有行为对,TGA通过从三个方面识别有意义的行为转换来构建一个结构化的稀疏图:(a) 商品层面的转换;(b) 类别层面的转换;以及(c) 邻居层面的转换。 基于这样的结构化图,TGA采用了感知过渡的图注意力机制,能够同时建模用户与物品之间的交互和行为类型的变化。这不仅有助于更准确地捕捉序列模式,而且还能保持计算效率。 实验表明,TGA在性能上超越了所有现有的最佳模型,并且显著降低了计算成本。值得注意的是,在大规模工业生产环境中部署后,TGA带来了关键业务指标的显著改进。
https://arxiv.org/abs/2601.14955
While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM's general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.
虽然大型语言模型现在可以处理百万级别的上下文令牌,但它们在整个文档库中的推理能力尚未得到充分测试。现有的基准测试不足以评估这一点,因为这些基准主要局限于单一的长文本或基于“稀疏检索”假设——即答案可以从几个相关的片段中得出。这种假设对于真正的语料库级分析是无效的,在这种情况下,证据高度分散在数百份文档之间,并且回答需要对整个文本来进行整合、比较和统计汇总。 为了解决这一关键缺口,我们介绍了CorpusQA,这是一个新的基准测试项目,可以扩展到1000万个令牌,通过一个新颖的数据合成框架生成。这个框架将推理与文本表示分离,创建了复杂且计算密集的查询,并提供了程序性保证的真实答案,挑战系统在不依赖不可靠的人类注释的情况下进行全面地、无结构的大规模文本分析。 我们进一步展示了该框架在评估之外的应用价值,表明在我们的合成数据上进行微调可以显著提高大型语言模型(LLM)处理长上下文推理的能力。广泛的实验显示,即使是最先进的长上下文LLM也难以应对输入长度的增加,而标准增强检索生成系统则完全失效。 我们的研究结果表明,具有记忆增强代理架构的方法提供了更稳健的选择,并暗示了一个从简单扩展上下文窗口到开发用于全球信息合成的先进架构的重大转变是必要的。
https://arxiv.org/abs/2601.14952
Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at this https URL.
电影配音的任务是根据视频场景从脚本中合成语音,要求精确的唇部同步、忠实的声音特质转移以及恰当的角色身份和情感建模。然而,现有方法面临两大限制:(1)高质量多模式配音数据集规模有限,存在高错误率,注释稀疏,依赖昂贵的手动标注,并局限于独白场景,这些因素阻碍了有效的模型训练;(2)现有的配音模型仅依赖唇部区域来学习音频-视觉对齐,这限制了它们在复杂的真实电影场景中的适用性,并且在唇部同步、语音质量以及情感表达方面表现不佳。为了解决这些问题,我们提出了FunCineForge,这是一个用于大规模配音数据集的端到端生产管道和一个基于多模态大型语言模型(MLLM)设计的适合多样化电影场景的配音模型。使用该管道,我们构建了首个具有丰富注释的中文电视剧配音数据集,并展示了这些数据的高质量特性。在独白、旁白、对话以及多说话人场景中的实验表明,我们的配音模型在音频质量、唇部同步、音色转移和指令遵循方面均优于现有最先进的方法(SOTA)。代码和演示可在提供的链接中访问。 原文链接:[提供链接]
https://arxiv.org/abs/2601.14777
Infrared small target detection (ISTD) under complex backgrounds remains a critical yet challenging task, primarily due to the extremely low signal-to-clutter ratio, persistent dynamic interference, and the lack of distinct target features. While multi-frame detection methods leverages temporal cues to improve upon single-frame approaches, existing methods still struggle with inefficient long-range dependency modeling and insufficient robustness. To overcome these issues, we propose a novel scheme for ISTD, realized through a sparse frames-based spatio-temporal semantic feedback network named FeedbackSTS-Det. The core of our approach is a novel spatio-temporal semantic feedback strategy with a closed-loop semantic association mechanism, which consists of paired forward and backward refinement modules that work cooperatively across the encoder and decoder. Moreover, both modules incorporate an embedded sparse semantic module (SSM), which performs structured sparse temporal modeling to capture long-range dependencies with low computational cost. This integrated design facilitates robust implicit inter-frame registration and continuous semantic refinement, effectively suppressing false alarms. Furthermore, our overall procedure maintains a consistent training-inference pipeline, which ensures reliable performance transfer and increases model robustness. Extensive experiments on multiple benchmark datasets confirm the effectiveness of FeedbackSTS-Det. Code and models are available at: this https URL.
在复杂背景下的红外小目标检测(ISTD)仍然是一个关键且具有挑战性的任务,主要原因是极低的信号与杂波比率、持续的动态干扰以及缺乏明显的目标特征。虽然多帧检测方法通过利用时间线索来改进单帧方法的表现,但现有方法仍然难以有效建模长期依赖关系并保持足够的鲁棒性。为了解决这些问题,我们提出了一种基于稀疏帧的空间-时间语义反馈网络(FeedbackSTS-Det)的新颖方案。 我们的核心策略是一种新颖的时空语义反馈机制,该机制包含一个闭合循环语义关联机理,由一对协同工作的前向和后向细化模块组成。这些模块在编码器和解码器之间交替运行。此外,两个模块都集成了一种嵌入式稀疏语义模块(SSM),它执行结构化稀疏时间建模以低成本捕捉长期依赖关系。这种综合设计促进了稳健的隐式帧间注册,并实现了连续语义细化,有效抑制了误警。 此外,我们整个程序保持一致的训练-推理流程,确保可靠的性能转移并增强模型鲁棒性。在多个基准数据集上的广泛实验验证了FeedbackSTS-Det的有效性。代码和模型可在以下网址获得:this https URL。
https://arxiv.org/abs/2601.14690
Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect while respecting pedagogical principles and operational constraints. Although large language models (LLMs) offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging due to (i) misalignment with pedagogical objectives such as the Zone of Proximal Development (ZPD) under sparse, delayed feedback, (ii) scarce and costly expert demonstrations, and (iii) multi-objective interactions among learning effect, difficulty scheduling, length controllability, and trajectory diversity. To address these issues, we propose IB-GRPO (Indicator-Based Group Relative Policy Optimization), an indicator-guided alignment approach for LLM-based LPR. To mitigate data scarcity, we construct hybrid expert demonstrations via Genetic Algorithm search and teacher RL agents and warm-start the LLM with supervised fine-tuning. Building on this warm-start, we design a within-session ZPD alignment score for difficulty scheduling. IB-GRPO then uses the $I_{\epsilon+}$ dominance indicator to compute group-relative advantages over multiple objectives, avoiding manual scalarization and improving Pareto trade-offs. Experiments on ASSIST09 and Junyi using the KES simulator with a Qwen2.5-7B backbone show consistent improvements over representative RL and LLM baselines.
学习路径推荐(LPR)的目标是生成个性化学习项目序列,以最大化长期的学习效果,同时遵循教育原则和操作约束。虽然大型语言模型(LLM)提供了丰富的语义理解来进行开放式建议,但将其应用于长周期的LPR仍面临挑战,这主要是由于: 1. 在稀疏、延迟反馈的情况下,与如最近发展区(ZPD)这样的教育目标存在不一致; 2. 专家示范稀缺且成本高昂; 3. 学习效果、难度调度、长度可控性和轨迹多样性之间的多目标互动复杂。 为了应对这些问题,我们提出了IB-GRPO(基于指标的群组相对政策优化),这是一种用于LLM基础LPR的指示符引导对齐方法。为了解决数据稀缺性问题,我们通过遗传算法搜索和教师强化学习代理来构建混合专家示范,并使用监督微调预热LLM。在此基础上,我们设计了一种会话内ZPD对齐分数来进行难度调度。 IB-GRPO利用$I_{\epsilon+}$支配指标计算多个目标下的群组相对优势,避免了手动标量化并改善了帕累托权衡。 在使用Qwen2.5-7B骨干的KES模拟器上进行ASSIST09和Junyi的数据集实验显示,IB-GRPO相对于代表性的强化学习(RL)和LLM基准方法有持续改进。
https://arxiv.org/abs/2601.14686
Recent advances in embodied intelligence have leveraged massive scaling of data and model parameters to master natural-language command following and multi-task control. In contrast, biological systems demonstrate an innate ability to acquire skills rapidly from sparse experience. Crucially, current robotic policies struggle to replicate the dynamic stability, reflexive responsiveness, and temporal memory inherent in biological motion. Here we present Neuromorphic Vision-Language-Action (NeuroVLA), a framework that mimics the structural organization of the bio-nervous system between the cortex, cerebellum, and spinal cord. We adopt a system-level bio-inspired design: a high-level model plans goals, an adaptive cerebellum module stabilizes motion using high-frequency sensors feedback, and a bio-inspired spinal layer executes lightning-fast actions generation. NeuroVLA represents the first deployment of a neuromorphic VLA on physical robotics, achieving state-of-the-art performance. We observe the emergence of biological motor characteristics without additional data or special guidance: it stops the shaking in robotic arms, saves significant energy(only 0.4w on Neuromorphic Processor), shows temporal memory ability and triggers safety reflexes in less than 20 milliseconds.
最近在具身智能领域的进展利用了数据和模型参数的大量扩展,以掌握自然语言命令跟随和多任务控制。相比之下,生物系统展示了从稀疏经验中快速获得技能的内在能力。至关重要的是,当前的机器人策略难以复制生物运动所固有的动态稳定性、反射响应能力和时间记忆。 在这里,我们提出了类脑视觉-语言-行动(NeuroVLA)框架,该框架模仿了大脑皮层、小脑和脊髓之间的生物神经系统结构组织。我们采用了一种系统级别的仿生设计:高级模型规划目标,适应性小脑模块利用高频传感器反馈来稳定运动,而仿生脊髓层则执行闪电般快速的动作生成。 NeuroVLA是首个在物理机器人上部署的类脑VLA框架,并取得了业界领先的性能。值得注意的是,在没有额外数据或特殊指导的情况下,该系统展示了生物运动特性:它停止了机械臂的抖动,节省了大量的能源(仅耗电0.4瓦),表现出时间记忆能力,并且能在不到20毫秒的时间内触发安全反射动作。
https://arxiv.org/abs/2601.14628
Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.
语音情感识别模型通常使用单一类别标签,忽略了人类情绪的内在模糊性。为了解决这一问题,模糊情感识别通过将情感表示为概率分布来进行建模,但其进展受到由稀疏的人类注释推断出的不可靠真实地面分布的限制。本文探讨了大型音频-语言模型(ALMs)是否可以通过生成高质量的合成注释来缓解标注瓶颈。我们引入了一个框架,利用ALM来创建合成感知代理,以增强人类注释并提高真实地面分布的可靠性。通过统计分析其与人类分布的一致性来验证这些代理,并通过在扩充的情绪分布上微调ALMs来评估它们的影响。 为了处理类别不平衡问题,并进行无偏评价,我们提出了一种基于分布感知的多模态情感增强策略DiME-Aug。在IEMOCAP和MSP-Podcast数据集上的实验表明,合成注释可以改善情绪分布,特别是在标注一致性较高的低模糊区域更为显著。然而,在人类意见分歧较大的高度模糊的情感中,其益处会减弱。 这项工作首次证明了ALMs可能解决模糊情感识别中的标注稀缺问题,但同时也强调需要更先进的提示或生成策略来处理高度模糊的情况。
https://arxiv.org/abs/2601.14620
The sparse object detection paradigm shift towards dense 3D semantic occupancy prediction is necessary for dealing with long-tail safety challenges for autonomous vehicles. Nonetheless, the current voxelization methods commonly suffer from excessive computation complexity demands, where the fusion process is brittle, static, and breaks down under dynamic environmental settings. To this end, this research work enhances a novel Gaussian-based adaptive camera-LiDAR multimodal 3D occupancy prediction model that seamlessly bridges the semantic strengths of camera modality with the geometric strengths of LiDAR modality through a memory-efficient 3D Gaussian model. The proposed solution has four key components: (1) LiDAR Depth Feature Aggregation (LDFA), where depth-wise deformable sampling is employed for dealing with geometric sparsity, (2) Entropy-Based Feature Smoothing, where cross-entropy is employed for handling domain-specific noise, (3) Adaptive Camera-LiDAR Fusion, where dynamic recalibration of sensor outputs is performed based on model outputs, and (4) Gauss-Mamba Head that uses Selective State Space Models for global context decoding that enjoys linear computation complexity.
从稀疏的目标检测范式转向密集的3D语义占用预测对于解决自主车辆面临的长期安全挑战是必要的。然而,目前常用的体素化方法普遍面临着计算复杂度过高的问题,其中融合过程脆弱、静态,并且在动态环境设置下会失效。为此,这项研究工作提出了一种基于高斯的自适应相机-激光雷达多模态3D占用预测模型,该模型通过内存高效的3D高斯模型无缝地将相机模式的语义优势与激光雷达模式的几何优势结合在一起。所提出的解决方案有四个关键组成部分: 1. 激光雷达深度特征聚合(LDFA),其中采用基于深度的可变形采样来处理几何稀疏性。 2. 基于熵的功能平滑,使用交叉熵来处理特定领域的噪声问题。 3. 自适应相机-激光雷达融合,在此过程中根据模型输出对传感器数据进行动态校准和融合。 4. Gauss-Mamba 头部模块,该模块采用选择性状态空间模型来进行全局上下文解码,并且具有线性的计算复杂度。 这种综合方法旨在提高自动驾驶汽车在各种环境条件下的感知能力与安全性。
https://arxiv.org/abs/2601.14448