To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (this https URL).
为了减轻大型语言模型(LLMs)的计算负担,采用激活稀疏架构(如专家混合MoE)吸引了越来越多的关注。然而,传统MoE中非可微和刚性的路由机制损害了模型性能。此外,尽管每个标记仅激活少数参数,但这些稀疏激活架构在块级表现出低稀疏性,即多个连续标记的组合会激活大量参数的比例。这种稀疏模式不利于资源受限条件(例如终端设备)下的加速,并且与主流加速技术(如投机解码)不兼容。为了解决这些问题,我们引入了一种新的MoE架构BlockFFN及其高效的训练和部署技术。具体而言,我们使用集成了ReLU激活和RMSNorm的路由器来实现可微分和灵活的路由机制。接下来,为了同时促进标记级稀疏性(TLS)和块级稀疏性(CLS),设计了CLS感知的训练目标,使得BlockFFN更加易于加速。最后,我们实现了高效的加速内核,并首次结合了激活稀疏性和投机解码技术。实验结果显示,BlockFFN在其他MoE基准模型上的性能优越,达到了超过80%的TLS和70%八标记CLS(Chunk-Level Sparsity)。我们的内核实现在真实终端设备上比密集模型快达3.67倍。所有代码和检查点均可公开访问(此 https URL 链接)。
https://arxiv.org/abs/2507.08771
Learning robot manipulation policies from raw, real-world image data requires a large number of robot-action trials in the physical environment. Although training using simulations offers a cost-effective alternative, the visual domain gap between simulation and robot workspace remains a major limitation. Gaussian Splatting visual reconstruction methods have recently provided new directions for robot manipulation by generating realistic environments. In this paper, we propose the first method for learning supervised-based robot handovers solely from RGB images without the need of real-robot training or real-robot data collection. The proposed policy learner, Human-to-Robot Handover using Sparse-View Gaussian Splatting (H2RH-SGS), leverages sparse-view Gaussian Splatting reconstruction of human-to-robot handover scenes to generate robot demonstrations containing image-action pairs captured with a camera mounted on the robot gripper. As a result, the simulated camera pose changes in the reconstructed scene can be directly translated into gripper pose changes. We train a robot policy on demonstrations collected with 16 household objects and {\em directly} deploy this policy in the real environment. Experiments in both Gaussian Splatting reconstructed scene and real-world human-to-robot handover experiments demonstrate that H2RH-SGS serves as a new and effective representation for the human-to-robot handover task.
从原始的真实世界图像数据中学习机器人操作策略需要在物理环境中进行大量的机器人动作试验。虽然使用仿真训练提供了成本效益高的替代方案,但模拟环境与机器人工作空间之间的视觉领域差距仍然是主要限制之一。最近,高斯点绘(Gaussian Splatting)的视觉重构方法为机器人操控提供了一些新的方向,通过生成逼真的环境来帮助这一问题。在本文中,我们提出了首个仅基于RGB图像进行监督学习的人机交接策略的方法,并且无需实际机器人的训练或数据采集。该提出的策略学习者名为“使用稀疏视图高斯点绘的人机交互政策学习器”(Human-to-Robot Handover using Sparse-View Gaussian Splatting,简称H2RH-SGS),它利用稀疏视图的高斯点绘重构人与机器人交接场景来生成包含相机安装在机械抓手上拍摄到的图像动作对的机器人演示。因此,在重建的场景中模拟相机姿态的变化可以直接转换为夹爪姿态的变化。我们使用16种家庭常用物品进行实验收集示范,并**直接**将此策略部署到了实际环境中。无论是高斯点绘重构的场景还是现实世界的人机交接实验,都表明H2RH-SGS对于人与机器人交互任务提供了一种新的且有效的表示方法。 该研究的核心在于利用了稀疏视图高斯点绘技术来生成逼真的训练环境,并以此为基础从RGB图像中学习到机器人的操作策略。这种方法不仅减少了对实际物理试验的需求,而且在转换到真实环境中时能够保持较高精度和有效性,为机器人操控任务提供了新的可能路径。
https://arxiv.org/abs/2507.08726
Multiple Instance Learning (MIL) offers a natural solution for settings where only coarse, bag-level labels are available, without having access to instance-level annotations. This is usually the case in digital pathology, which consists of gigapixel sized images. While deterministic attention-based MIL approaches achieve strong bag-level performance, they often overlook the uncertainty inherent in instance relevance. In this paper, we address the lack of uncertainty quantification in instance-level attention scores by introducing \textbf{SGPMIL}, a new probabilistic attention-based MIL framework grounded in Sparse Gaussian Processes (SGP). By learning a posterior distribution over attention scores, SGPMIL enables principled uncertainty estimation, resulting in more reliable and calibrated instance relevance maps. Our approach not only preserves competitive bag-level performance but also significantly improves the quality and interpretability of instance-level predictions under uncertainty. SGPMIL extends prior work by introducing feature scaling in the SGP predictive mean function, leading to faster training, improved efficiency, and enhanced instance-level performance. Extensive experiments on multiple well-established digital pathology datasets highlight the effectiveness of our approach across both bag- and instance-level evaluations. Our code will be made publicly available.
多示例学习(Multiple Instance Learning,MIL)为仅提供粗略的袋级标签而无实例级注释可用的情况提供了自然解决方案。这种情况通常出现在数字病理学中,因为该领域涉及的是数十亿像素大小的图像。虽然基于确定性注意力机制的MIL方法能够实现强大的袋级性能,但它们往往忽略了实例相关性的内在不确定性。在本文中,我们通过引入**SGPMIL**(一种基于稀疏高斯过程(SGP)的概率注意机制框架),解决了实例级别注意力分数中的不确定量化缺乏的问题。SGPMIL通过学习关注分数的后验分布来进行原则化的不确定性估计,从而产生更可靠和校准良好的实例相关性图。 我们的方法不仅保持了袋级性能的竞争水平,还显著提高了在不确定性下的实例级预测的质量和可解释性。SGPMIL通过在SGP预测均值函数中引入特征缩放,扩展了先前的工作,这带来了更快的训练速度、更高的效率以及增强的实例级别表现。在多个知名数字病理学数据集上的广泛实验表明,在袋级和实例级评估方面,我们的方法表现出色。我们将公开提供代码以供使用。
https://arxiv.org/abs/2507.08711
Despite advances from medical large language models in healthcare, rare-disease diagnosis remains hampered by insufficient knowledge-representation depth, limited concept understanding, and constrained clinical reasoning. We propose a framework that couples multi-granularity sparse activation of medical concepts with a hierarchical knowledge graph. Four complementary matching algorithms, diversity control, and a five-level fallback strategy enable precise concept activation, while a three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare-disease QA set show BLEU gains of 0.09, ROUGE gains of 0.05, and accuracy gains of 0.12, with peak accuracy of 0.89 approaching the 0.90 clinical threshold. Expert evaluation confirms improvements in information quality, reasoning, and professional expression, suggesting our approach shortens the "diagnostic odyssey" for rare-disease patients.
尽管医学大型语言模型在医疗保健领域取得了进展,但罕见病诊断仍然受到知识表示深度不足、概念理解有限和临床推理受限的阻碍。我们提出了一种框架,该框架结合了多粒度稀疏激活医学概念与层级知识图谱。四种互补匹配算法、多样性控制以及五级后备策略能够实现精确的概念激活,而三层知识图(分类学、临床特征、实例)则提供了结构化且最新的背景信息。 在BioASQ罕见病问答数据集上的实验显示,在BLEU指标上提高了0.09,在ROUGE指标上提高了0.05,并将准确率提升了0.12,峰值准确率达到0.89,接近临床标准的0.90阈值。专家评估确认信息质量、推理和专业表达均有所改善,这表明我们的方法能够缩短罕见病患者的“诊断之旅”。
https://arxiv.org/abs/2507.08529
Reliable satellite attitude control is essential for the success of space missions, particularly as satellites increasingly operate autonomously in dynamic and uncertain environments. Reaction wheels (RWs) play a pivotal role in attitude control, and maintaining control resilience during RW faults is critical to preserving mission objectives and system stability. However, traditional Proportional Derivative (PD) controllers and existing deep reinforcement learning (DRL) algorithms such as TD3, PPO, and A2C often fall short in providing the real time adaptability and fault tolerance required for autonomous satellite operations. This study introduces a DRL-based control strategy designed to improve satellite resilience and adaptability under fault conditions. Specifically, the proposed method integrates Twin Delayed Deep Deterministic Policy Gradient (TD3) with Hindsight Experience Replay (HER) and Dimension Wise Clipping (DWC) referred to as TD3-HD to enhance learning in sparse reward environments and maintain satellite stability during RW failures. The proposed approach is benchmarked against PD control and leading DRL algorithms. Experimental results show that TD3-HD achieves significantly lower attitude error, improved angular velocity regulation, and enhanced stability under fault conditions. These findings underscore the proposed method potential as a powerful, fault tolerant, onboard AI solution for autonomous satellite attitude control.
可靠的卫星姿态控制对于太空任务的成功至关重要,尤其是在卫星越来越多地在动态和不确定环境中自主运行的情况下。反应轮(RW)在姿态控制中扮演着关键角色,而维护反应轮故障期间的姿态控制韧性是保持任务目标和系统稳定性的重要因素。然而,传统的比例导数(PD)控制器以及现有的深度强化学习(DRL)算法如TD3、PPO和A2C通常无法提供自主卫星操作所需的实际适应性和容错性。 本研究提出了一种基于DRL的控制策略,旨在改善卫星在故障条件下的韧性和适应能力。具体而言,该方法将双延迟深度确定性策略梯度(TD3)与事后经验回放(HER)和维度逐个裁剪(DWC)相结合,称之为TD3-HD。这种方法增强了在稀疏奖励环境中学习的能力,并能够在反应轮故障期间保持卫星的稳定性。 研究提出的方案通过与PD控制及其他前沿的DRL算法进行基准比较来评估其性能。实验结果表明,TD3-HD实现了显著较低的姿态误差、改进了角速度调节,并且提高了故障条件下的稳定性。这些发现突显了所提出方法作为自主卫星姿态控制中强大而具有容错性的在轨AI解决方案的巨大潜力。
https://arxiv.org/abs/2507.08366
Deep learning has shown remarkable performance in integrating multimodal data for survival prediction. However, existing multimodal methods mainly focus on single cancer types and overlook the challenge of generalization across cancers. In this work, we are the first to reveal that multimodal prognosis models often generalize worse than unimodal ones in cross-cancer scenarios, despite the critical need for such robustness in clinical practice. To address this, we propose a new task: Cross-Cancer Single Domain Generalization for Multimodal Prognosis, which evaluates whether models trained on a single cancer type can generalize to unseen cancers. We identify two key challenges: degraded features from weaker modalities and ineffective multimodal integration. To tackle these, we introduce two plug-and-play modules: Sparse Dirac Information Rebalancer (SDIR) and Cancer-aware Distribution Entanglement (CADE). SDIR mitigates the dominance of strong features by applying Bernoulli-based sparsification and Dirac-inspired stabilization to enhance weaker modality signals. CADE, designed to synthesize the target domain distribution, fuses local morphological cues and global gene expression in latent space. Experiments on a four-cancer-type benchmark demonstrate superior generalization, laying the foundation for practical, robust cross-cancer multimodal prognosis. Code is available at this https URL
深度学习在整合多模态数据进行生存预测方面展现了显著的性能。然而,现有的多模态方法主要集中在单一癌症类型上,并且忽略了跨癌症泛化的挑战。在这项工作中,我们首次揭示了多模态预后模型通常在跨癌症场景中比单模态模型表现得更差,尽管临床实践中迫切需要这种鲁棒性。为了解决这个问题,我们提出了一项新任务:跨癌症单一域泛化多模态预后,评估的是基于单一癌症类型训练的模型是否能够推广到未见过的癌症上。 为了应对这一挑战,我们识别了两个关键问题:来自较弱模态的数据退化和无效的多模态整合。为此,我们引入了两种即插即用模块:稀疏狄拉克信息重整器(SDIR)和癌症感知分布纠缠(CADE)。SDIR通过应用基于伯努利的稀疏化及狄拉克灵感稳定化来缓解强特征的主导地位,以增强较弱模态信号。CADE旨在合成目标域分布,在潜在空间中融合局部形态线索和全局基因表达。 在四种癌症类型的基准测试上进行的实验显示出了优越的泛化能力,为实际应用中的稳健跨癌症多模态预后奠定了基础。代码可在提供的链接处获取。
https://arxiv.org/abs/2507.08340
Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: this https URL
交通事故虽然是罕见事件,但其影响重大,需要通过长期上下文的多模态推理来进行准确的风险预测。在本文中,我们介绍了ALCo-FM,这是一种统一的自适应长上下文基础模型,它计算一个波动预评分以动态选择输入数据的上下文窗口,并通过浅层交叉注意力机制编码和融合这些多模态数据。该模型随后使用局部GAT(图注意网络)层以及类似于BigBird的稀疏全局变换器在H3六边形网格上运行,并结合蒙特卡洛dropout方法来评估其置信度,从而能够生成优越且校准良好的预测结果。 ALCo-FM在美国15个城市的数据上进行训练,并采用类权重损失以解决标签不平衡问题。此外,在未见城市中使用最少数据微调后,该模型实现了0.94的准确率、0.92的F1分数以及0.04的ECE(期望校准误差),在大规模城市风险预测方面超过了超过20个最新的基准方法。 代码和数据集可在以下网址获取:this https URL
https://arxiv.org/abs/2507.08153
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the RE10K and ACID datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis. Project page: this https URL.
3D高斯点阵(3DGS)已展示了其从无姿态图像中重建场景的潜力。然而,基于优化的3DGS方法在处理视角稀疏的问题时由于先验知识有限而面临困难。同时,前馈式高斯方法受到输入格式的限制,使得难以整合更多的视图信息。为了解决这些问题,我们提出了RegGS,这是一种基于3D高斯配准框架的方法,用于重建无姿态且稀疏的视角场景。RegGS将由前馈网络生成的局部3D高斯点阵对齐到全局一致的3D高斯表示中。 从技术角度来看,我们实现了熵正则化的Sinkhorn算法来高效地解决最优传输混合2-Wasserstein(MW₂)距离问题,该距离作为$\mathrm{Sim}(3)$空间中高斯混合模型(GMMs)对齐度量。此外,我们设计了一个联合的3DGS配准模块,整合了MW₂距离、光度一致性以及深度几何信息。这使得可以在粗到细的过程中进行注册,并准确估计相机姿态和场景对齐。 在RE10K和ACID数据集上的实验表明,RegGS能够以高保真度注册局部高斯点阵,实现了精确的姿态估计和高质量的新视角合成。 项目页面:[此链接](https://this https URL/)。
https://arxiv.org/abs/2507.08136
The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner -- minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code -- such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth'' data as labelled by human annotators during the preparation of official environmental impact statements.
语言模型(LM)的出现有潜力显著加速可转换为文本处理的任务,然而,在现实世界中的应用受到安全、解释性和偏见等问题的影响。如何负责任地利用这些语言模型,并以透明和可审计的方式进行操作——最小化风险并允许人类专家专注于知情决策而不是数据处理或提示工程?在本工作中,我们提出了一种框架,用于声明静态类型、由LM驱动的子程序(即,在传统异步代码中使用的可调用函数),以便利用稀疏的人类专家反馈在线(即,在使用期间)改进每个子程序的表现。在我们的实现中,所有由LM生成的工件(例如提示、输入、输出和数据依赖性)都被记录下来,并且可以按需进行审计。我们将此框架打包为一个库以支持其采用和发展。 虽然这个框架可能适用于多个现实世界的决策工作流程(如医疗保健和法律领域),但我们是在1969年《国家环境政策法》(NEPA) 规定的公众评论处理背景下对其进行评估的:具体来说,我们使用此框架开发了“CommentNEPA”应用程序,该应用编译、整理并总结了一组提交给需要环境审查项目的公共意见。通过将没有人工反馈时的应用程序输出与历史上的“真实数据”进行比较——这些数据由人类标注者在准备官方环境影响声明过程中标记出来——我们对应用程序进行了定量评估。
https://arxiv.org/abs/2507.08109
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
我们提出了Q-chunking,这是一种简单而有效的方案,旨在改进针对长时序和稀疏奖励任务的强化学习(RL)算法。我们的方法是为离线到在线的RL场景设计的,在这种场景下,目标是利用一个离线的数据集来最大化在线学习中的样本效率。在这个设置中,有效探索和样本高效的学习仍然是核心挑战之一,因为如何使用离线数据以获得一个好的探索策略并不明显。我们的重要洞察在于:行动分块(在模仿学习中流行的一种技术,其中预测的是未来动作序列而非每个时间步的一次性动作)可以应用于基于时序差分(TD)的RL方法,以此来缓解探索难题。 Q-chunking通过直接在“分块”的动作空间中运行强化学习任务来采用行动分块。这样可以让智能体(1)利用来自离线数据集中的时间一致性行为进行更有效的在线探索;以及(2)使用无偏n步备份以实现更加稳定和高效的TD学习。 我们的实验结果表明,Q-chunking在离线性能和在线样本效率方面表现出色,在一系列长时序稀疏奖励操作任务上超越了先前最好的从离线到在线的方法。
https://arxiv.org/abs/2507.07969
Recent studies show large language models (LLMs) and vision language models (VLMs) trained using web-scale data can empower end-to-end autonomous driving systems for a better generalization and interpretation. Specifically, by dynamically routing inputs to specialized subsets of parameters, the Mixture-of-Experts (MoE) technique enables general LLMs or VLMs to achieve substantial performance improvements while maintaining computational efficiency. However, general MoE models usually demands extensive training data and complex optimization. In this work, inspired by the learning process of human drivers, we propose a skill-oriented MoE, called MoSE, which mimics human drivers' learning process and reasoning process, skill-by-skill and step-by-step. We propose a skill-oriented routing mechanism that begins with defining and annotating specific skills, enabling experts to identify the necessary driving competencies for various scenarios and reasoning tasks, thereby facilitating skill-by-skill learning. Further align the driving process to multi-step planning in human reasoning and end-to-end driving models, we build a hierarchical skill dataset and pretrain the router to encourage the model to think step-by-step. Unlike multi-round dialogs, MoSE integrates valuable auxiliary tasks (e.g.\ description, reasoning, planning) in one single forward process without introducing any extra computational cost. With less than 3B sparsely activated parameters, our model outperforms several 8B+ parameters on CODA AD corner case reasoning task. Compared to existing methods based on open-source models and data, our approach achieves state-of-the-art performance with significantly reduced activated model size (at least by $62.5\%$) with a single-turn conversation.
最近的研究表明,利用网络规模数据训练的大规模语言模型(LLMs)和视觉语言模型(VLMs)能够增强端到端的自动驾驶系统,提高其泛化能力和解释能力。具体而言,通过将输入动态路由至参数子集的专业领域,混合专家(MoE)技术使通用的LLM或VLM在保持计算效率的同时实现了显著的性能提升。然而,一般的MoE模型通常需要大量的训练数据和复杂的优化过程。 在这项工作中,我们借鉴了人类驾驶员的学习过程,提出了以技能为导向的MoE方法,称为MoSE。该方法模仿了人类驾驶员逐步学习驾驶技巧的过程,并分步骤进行推理。我们提出了一种技能导向的路由机制,首先定义并标注特定技能,使专家能够识别各种场景和推理任务所需的关键驾驶能力,从而实现逐项技能的学习。 进一步地,我们将驾驶过程与人类多步规划相联系,并为端到端的驾驶模型构建了一个分层的技能数据集。我们预先训练了路由器以鼓励模型逐步进行思考。不同于需要多次对话的过程,MoSE在单一前向传递过程中整合了有价值的辅助任务(例如描述、推理和计划),而无需增加任何额外的计算成本。 使用不到30亿稀疏激活参数的情况下,我们的模型在CODA AD角落案例推理任务上超越了几种8B+规模的模型。与现有的基于开源模型和数据的方法相比,在单轮对话中,我们的方法实现了最先进的性能,并且显著减少了激活模型的大小(至少减少了62.5%)。
https://arxiv.org/abs/2507.07818
This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: this https URL.
这篇论文建立了首个全面回顾大型语言模型(LLMs)在法律领域应用的综述。它开创了一种创新性的双重视角分类法,整合了法律推理框架和专业本体论,系统地统一了历史研究与当代突破。基于Transformer架构的LLM们展现出诸如上下文推理和生成性论证等新兴能力,通过动态捕捉法律语义和统一证据推理,这些模型克服了传统方法的局限性。本文记录了在任务泛化、推理形式化、工作流程整合以及文本处理、知识集成和评估严谨性等方面的关键挑战中取得的重大进展,并介绍了稀疏注意力机制和专家混合架构等技术革新。 然而,LLM的广泛采用也带来了关键性的挑战:幻觉(生成不真实或误导信息)、解释能力不足、地区适应困难及伦理不对称等问题。这篇综述提出了一种新颖的分类方法,将法律角色映射到自然语言处理子任务上,并计算实现图尔敏论证框架,从而系统化了推理、检索、预测和争议解决方面的进展。该论文还确定了低资源系统、多模态证据整合及动态反驳处理等关键前沿领域。 最终,这项工作为研究人员提供了技术路线图,同时也为从业者在算法未来中导航提供了一个概念框架,并为此后的法律人工智能时代奠定了坚实的基础。我们创建了一个GitHub仓库来索引相关文献:[这个链接](https://this-url.github.com)(请将"this https URL"替换为实际的GitHub链接)。
https://arxiv.org/abs/2507.07748
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at this https URL.
视频时间定位(VTG)涉及基于文本查询的时刻检索(MR)和精彩片段检测(HD)。对于这个问题,大多数方法依赖于冻结的大规模预训练骨干模型的最后一层特征,这限制了它们在新领域中的适应性。虽然全面微调通常不切实际,但参数高效的微调——特别是侧向微调(ST)——已作为有效的替代方案出现。然而,以往的ST方法主要从帧级细化的角度来解决这个问题,忽略了MR固有的稀疏特性。为此,我们提出了Sparse-Dense Side-Tuner (SDST),这是首个无锚点的VTG ST架构。此外,我们还引入了基于参考的可变形自注意力机制,这是一种新颖的方法,能够增强现有的无锚点方法在可变形注意机制下的上下文建模能力。同时,我们也展示了如何将InternVideo2骨干模型首次有效地集成到ST框架中,并显示了这种方法对性能提升的巨大影响。 总体而言,我们的方法显著提升了现有的ST方法,在QVHighlights、TACoS和Charades-STA数据集上取得了极具竞争力或最先进的结果,同时相对于现有最先进方法减少了多达73%的参数数量。该代码可在此网址公开访问:[https URL](实际链接需要根据原文中的具体地址替换)。
https://arxiv.org/abs/2507.07744
Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.
当对大型语言模型(LLM)进行再训练,以纳入更广泛的知识时,微调是一个极其耗费资源的过程。尽管已经开发了许多减少时间和计算成本的微调技术,但由于LLM在规模和复杂性上的持续增长,这一挑战仍然存在。为了解决这个问题,需要一种新的知识扩展方法。检索增强生成(RAG)提供了一种替代方案,通过将外部知识存储在数据库中,并检索相关的片段来支持问答过程。然而,简单的RAG实现面临可扩展性和答案准确性方面的显著限制。 本文介绍了一种名为KeyKnowledgeRAG (K2RAG)的新型框架,旨在克服这些限制。受“分而治之”范式的启发,K2RAG集成了密集和稀疏向量搜索、知识图谱以及文本摘要功能,以提升检索质量和系统效率。该框架还包括一个预处理步骤,用于总结训练数据,从而大幅减少了训练时间。 为了评估K2RAG的效果,使用了MultiHopRAG数据集,在这个数据集中,提出的管道在文档语料库上进行了训练,并在一个独立的评估集合上进行了测试。结果表明,与常见的简单RAG实现相比,K2RAG取得了显著的进步。在最高平均答案相似度评分方面,K2RAG获得了0.57分,并且达到了最高的第三四分位数(Q3)相似度为0.82,这表明它比其他方法更接近于与真实回答的对齐。 除了提高准确性之外,该框架还表现出很高的效率。摘要步骤将各个组件的平均训练时间减少了93%,执行速度也快于传统基于知识图谱的RAG系统的40%。此外,K2RAG展示了出色的可扩展性,需要比本文研究中测试的一些简单RAG实现少三倍的虚拟内存(VRAM)。 总结而言,KeyKnowledgeRAG (K2RAG)提供了一种创新的方法来改进大型语言模型的知识检索和生成过程,同时在准确性和效率方面显著优于现有的方法。
https://arxiv.org/abs/2507.07695
Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.
近期视频生成技术的进步催生了一种新兴的生成式视频编码范式,旨在通过利用强大的生成先验,在超低比特率(ULB)场景中实现语义精确的重建。然而,大多数现有的方法受到领域特定性(例如面部或人体视频)或对高水平文本指导过度依赖的限制,这往往无法捕捉运动细节,并导致不真实的重建结果。为了解决这些挑战,我们提出了一种轨迹引导生成式视频编码框架(称为T-GVC)。T-GVC采用语义感知稀疏运动采样管道,通过基于像素级运动的重要性和其语义重要性提取稀疏轨迹点来有效地将低级别的运动跟踪与高级别的语义理解相结合。这不仅显著降低了比特率,还保留了关键的时间语义信息。 此外,通过在扩散过程中整合轨迹对齐的损失约束,我们引入了一种无需训练的潜在空间指导机制,以确保物理上合理的运动模式,同时不牺牲生成模型内在的能力。实验结果表明,在ULB条件下,我们的框架优于传统的编解码器和最先进的端到端视频压缩方法。此外,额外的实验确认了我们的方法比现有的文本引导的方法实现了更精确的运动控制,为基于几何运动建模指导的生成式视频编码开辟了一条新的方向。
https://arxiv.org/abs/2507.07633
We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.
我们研究了在驾驶环境中利用基于图像的视觉-语言模型(VLMs)进行激光雷达扫描开放词汇语义分割的方法。传统上,可以将图像语义反向投影到3D点云中。然而,由此产生的点标签是嘈杂且稀疏的。为了确保时空一致性和对图像级别增强的鲁棒性,我们整合了这些标签。随后,基于这些优化后的标签训练了一个3D网络。这种方法被称为LOSC,在nuScenes和SemanticKITTI数据集上,无论是零样本开放词汇语义分割还是全景分割任务中,都显著优于现有最佳方法(SOTA)。
https://arxiv.org/abs/2507.07605
What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover algorithmic primitives, reflected in latent representations, attention, and inference-time compute, and their algorithmic composition to solve task-specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top-down hypotheses about candidate algorithms, and bottom-up tests of these hypotheses via circuit-level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource-intensive scaling, reorienting the field toward a principled understanding of underlying computations. Such algorithmic explanations offer a pathway to human-understandable interpretability, enabling comprehension of the model's internal reasoning performance measures. This can in turn lead to more sample-efficient methods for training and improving performance, as well as novel architectures for end-to-end and multi-agent systems.
LLM(大型语言模型)实际上学习和使用哪些算法来解决问题?针对这一问题的研究非常稀少,因为研究重点集中在通过规模扩大来提升性能上,这导致了对新兴算法的理解在理论和实证方面存在空白。本文提出了一种名为AlgEval的框架,旨在系统地研究LLM所学和使用的算法。AlgEval的目标是揭示反映在潜在表示、注意力机制及推理时间计算中的算法基础元素及其组合方式,以解决特定任务的问题。 我们强调了几条可能的方法路径,并通过一个案例研究来探讨这一目标,重点关注新兴搜索算法。我们的案例研究表明了如何从上至下形成关于候选算法的假设以及如何通过分析注意力模式和隐藏状态进行自下而上的假设验证。 对LLM实际解决问题方式的严格、系统性评估提供了一种替代资源密集型规模扩增的方法,使研究领域重新聚焦于底层计算原理的理解。这种算法解释为人类可理解的模型解释提供了途径,有助于人们理解模型内部推理性能指标。这反过来可以导致更加样本高效的训练和改进方法的出现,并催生适用于端到端及多代理系统的新型架构。
https://arxiv.org/abs/2507.07544
Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.
稠密点的轨迹建模通常采用隐式变形场,这些变形场由将坐标映射到时态偏移的神经网络表示。然而,神经网络固有的归纳偏差在问题条件较差的情况下会阻碍空间一致性。当前的方法主要集中在改进用于变形场编码的战略上,这往往导致模型不透明且缺乏直观性;或者采用显式技术如线性混合蒙皮(linear blend skinning),这类方法依赖于基于启发式的节点初始化策略。此外,隐式表示在插值稀疏时间信号方面的潜力尚未得到充分探索。 为了应对这些挑战,我们提出了一种基于样条的轨迹表示方法,在这种方法中,结点的数量明确地决定了自由度。这种做法使得速度的高效解析导出成为可能,同时保持了空间一致性并减少了时态波动。为了解决结点在时空域中的特性建模问题,我们引入了一种新颖的低秩时间变化的空间编码方式,替代传统的耦合时空技术。 我们的方法在稀疏输入条件下拟合连续场的时间插值方面表现出色,并且在动态场景重建质量上与最先进的方法相当。同时,在不依赖于线性混合蒙皮或尽可能刚性的约束的情况下,增强了运动的一致性。
https://arxiv.org/abs/2507.07521
Modern deep learning implementations for medical imaging usually rely on large labeled datasets. These datasets are often difficult to obtain due to privacy concerns, high costs, and even scarcity of cases. In this paper, a label-efficient strategy is proposed for chest X-ray diagnosis that seeks to reflect real-world hospital scenarios. The experiments use the NIH Chest X-ray14 dataset and a pre-trained CLIP ViT-B/32 model. The model is adapted via partial fine-tuning of its visual encoder and then evaluated using zero-shot and few-shot learning with 1-16 labeled examples per disease class. The tests demonstrate that CLIP's pre-trained vision-language features can be effectively adapted to few-shot medical imaging tasks, achieving over 20\% improvement in mean AUC score as compared to the zero-shot baseline. The key aspect of this work is to attempt to simulate internal hospital workflows, where image archives exist but annotations are sparse. This work evaluates a practical and scalable solution for both common and rare disease diagnosis. Additionally this research is intended for academic and experimental purposes only and has not been peer reviewed yet. All code is found at this https URL.
现代深度学习在医学影像领域的实现通常依赖于大规模的标注数据集。由于隐私问题、高昂的成本以及病例稀缺等原因,这些数据集往往难以获取。本文提出了一种针对胸部X光诊断的标签高效策略,旨在反映真实世界的医院场景。实验中使用了NIH Chest X-ray14数据集和一个预训练的CLIP ViT-B/32模型。通过部分微调该模型的视觉编码器,并利用零样本学习和少量样本学习(每种疾病类别从1到16个标注示例)对模型进行评估。实验结果表明,CLIP预先训练好的视觉-语言特征可以有效地适应医疗影像中的少数样本任务,在平均AUC得分上相较于零样本基准方法提高了超过20%。 这项工作的关键在于尝试模拟内部医院的工作流程,在这种流程中存在图像档案但标注信息稀少。该研究评估了一种针对常见病和罕见病诊断的实际且可扩展的解决方案。需要注意的是,本研究仅用于学术和实验目的,并未经过同行评审。所有代码可在以下链接找到:[这里插入URL]。
https://arxiv.org/abs/2507.07254
Supply chain networks are complex systems that are challenging to analyze; this problem is exacerbated when there are illicit activities involved in the supply chain, such as counterfeit parts, forced labor, or human trafficking. While machine learning (ML) can find patterns in complex systems like supply chains, traditional ML techniques require large training data sets. However, illicit supply chains are characterized by very sparse data, and the data that is available is often (purposely) corrupted or unreliable in order to hide the nature of the activities. We need to be able to automatically detect new patterns that correlate with such illegal activity over complex, even temporal data, without requiring large training data sets. We explore neurosymbolic methods for identifying instances of illicit activity in supply chains and compare the effectiveness of manual and automated feature extraction from news articles accurately describing illicit activities uncovered by authorities. We propose a question tree approach for querying a large language model (LLM) to identify and quantify the relevance of articles. This enables a systematic evaluation of the differences between human and machine classification of news articles related to forced labor in supply chains.
供应链网络是复杂系统,分析起来颇具挑战性;当供应链中涉及非法活动(如假冒零件、强迫劳动或人口贩卖)时,这一问题变得更加严峻。虽然机器学习(ML)可以在复杂的系统中发现模式,但传统机器学习技术需要大量的训练数据集。然而,非法供应链的特点是数据非常稀疏,并且即使有可用的数据也常常被故意篡改或不可靠,以隐藏活动的本质。我们需要能够在没有大量训练数据集的情况下,在复杂甚至时间序列数据中自动检测出与这些违法行为相关的新模式。 本文探讨了神经符号方法在识别供应链中的非法行为实例方面的应用,并比较了从准确描述当局发现的非法行为的新闻文章中手动和自动化特征提取的有效性。我们提出了一种问题树的方法来查询大型语言模型(LLM),以识别并量化文章的相关性。这使得系统地评估与强迫劳动相关的新闻文章的人类和机器分类之间的差异成为可能。
https://arxiv.org/abs/2507.07217