Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
生成推荐系统通过利用语义ID来表示项目已经取得了显著的进步。然而,现有独立处理每种模式的方法面临着两个关键限制:(1)跨模式冗余降低了效率;(2)未能捕捉到模态间的相互作用,从而限制了项目的表征能力。我们引入了一个名为FusID的融合式语义ID框架,该框架通过三个核心组件解决了这些问题:(i)多模态融合,通过联合编码多种模式的信息来学习统一表示;(ii)表征学习,将频繁共同出现的商品嵌入体拉近的同时保持差异性和防止特征冗余;(iii)产品量化,将融合后的连续嵌入转换为多个离散令牌以缓解ID冲突。在多模态下一首歌推荐(即播放列表延续)基准测试中,FusID实现了零ID冲突,确保每个令牌序列映射到唯一的歌曲,减轻了代码本未充分利用的问题,并且在MRR和Recall@k(k = 1, 5, 10, 20)指标上超越了基线方法。
https://arxiv.org/abs/2601.08764
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
强化学习(RL)已成为大型语言模型(LLMs)训练后处理的核心范式,尤其是在复杂的推理任务中。然而,它常常会遇到探索崩溃的问题:策略过早地集中在少数占主导地位的推理模式上,在提高首次通过率(pass@1)的同时限制了多步决策过程中的多样性以及在多次尝试中的成功率(pass@k)。我们主张这一问题源于对局部令牌行为的正则化,而非解决方案集多样性的优化。 为了解决这个问题,我们提出了一种基于独特性感知的强化学习方法,这是一种多步级别的目标设定,它明确奖励展现出罕见高层次策略的正确解决方案。我们的方法使用一个基于LLM的评判系统来根据问题的高层次解决方案策略对同一问题的不同尝试进行聚类,并忽略表面差异,然后根据聚类大小逆向调整策略收益。因此,正确的但新颖的战略可以获得比冗余战略更高的回报。 在数学、物理和医学推理基准测试中,我们的方法一致地提高了大采样预算下的pass@k值,并且在不牺牲首次通过率(pass@1)的情况下增加了pass@k曲线下的面积(AUC@$K$),同时保持了探索性并揭示了更多种类的解决方案策略。
https://arxiv.org/abs/2601.08763
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.
链式思维(Chain-of-Thought,CoT)推理已被证明可以有效提升大型语言模型的性能,通过鼓励逐步、中间推理的方式来实现。近期进展已将这一范式扩展到多模态大型语言模型(MLLMs)。在医疗领域中,诊断决策依赖于细微的视觉线索和顺序推理,链式思维与临床思维方式自然契合。然而,目前用于医学图像理解的基准测试通常仅关注最终答案,而忽视了推理路径。缺乏透明度的过程难以提供可靠的判断依据,使得医生难以利用其进行辅助诊断。 为解决这一问题,我们引入了一个新的M3CoTBench基准测试,专门设计用于评估链式思维在医学图像理解中的正确性、效率、影响和一致性。该基准包括以下特点: 1. 一个涵盖24种检查类型的多样性和多难度级别的数据集。 2. 包含不同难度等级的13个任务。 3. 针对临床推理量身定制的一系列链式思维特定评估指标(正确性、效率、影响和一致性)。 4. 多个MLLM性能分析。 M3CoTBench系统地评估了各种医学影像任务中的链式推理,揭示了当前多模态大型语言模型在生成可靠且临床解释性强的推理方面存在的局限,并致力于推动透明、可信及诊断准确的人工智能系统的开发。项目页面链接:[此URL](https://project-page-url.com)(请将实际项目页URL插入此处)。
https://arxiv.org/abs/2601.08758
The rapid growth of urban populations and the increasing need for sustainable transportation solutions have prompted a shift towards electric buses in public transit systems. However, the effective management of mixed fleets consisting of both electric and diesel buses poses significant operational challenges. One major challenge is coping with dynamic electricity pricing, where charging costs vary throughout the day. Transit agencies must optimize charging assignments in response to such dynamism while accounting for secondary considerations such as seating constraints. This paper presents a comprehensive mixed-integer linear programming (MILP) model to address these challenges by jointly optimizing charging schedules and trip assignments for mixed (electric and diesel bus) fleets while considering factors such as dynamic electricity pricing, vehicle capacity, and route constraints. We address the potential computational intractability of the MILP formulation, which can arise even with relatively small fleets, by employing a hierarchical approach tailored to the fleet composition. By using real-world data from the city of Chattanooga, Tennessee, USA, we show that our approach can result in significant savings in the operating costs of the mixed transit fleets.
城市人口的迅速增长和对可持续交通解决方案的需求增加,促使公共交通系统向电动巴士转变。然而,在包含电动和柴油巴士混合车队的有效管理中面临着重大的运营挑战。其中一个主要挑战是如何应对动态电价的问题,即充电成本在一天之内会有所不同。运输机构必须根据这种波动性来优化充电安排,并同时考虑诸如座位限制等次要因素。本文提出了一种全面的混合整数线性规划(MILP)模型,以解决这些挑战,通过共同优化包括电动和柴油巴士在内的混合车队的充电时间表与行车任务分配问题,并考虑到动态电价、车辆容量以及路线约束等因素。我们采用分层方法来应对MILP公式可能产生的计算复杂度问题,即使在规模较小的车队中也可能出现这种情况。通过使用美国田纳西州查塔努加市的真实世界数据,证明了我们的方法可以在混合交通车队的运营成本上实现显著节省。
https://arxiv.org/abs/2601.08753
Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive this http URL proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.
最近的自然语言处理进展表明,文本已成为生态学中的新兴数据来源。文本资源携带了独特的信息,可以与地理空间数据源互补使用,从而为环境条件和传统数据源难以揭示的地方特性提供见解。在地理位置上下文中利用文本信息面临若干挑战。首先,在生态背景下,文本数据的贡献尚不明确,其应当被纳入的任务也不清楚。与普遍存在的卫星图像或环境协变量不同,文本数据的可用性稀疏且不规则;将其与地理空间数据整合也非易事。 为应对这些挑战,本工作提出了一种基于注意力的方法,在一个地理位置邻域内结合高空影像和定位文本信息,即综合多个附近观测结果的影响。我们通过将视觉和文本表示与地理位置编码相结合,并采用一种基于注意力的模块动态选择对预测有用的空间邻居来实现这一目标。 所提出的这种方法应用于EcoWikiRS数据集,该数据集结合了高分辨率的航空图像以及从维基百科中提取的描述瑞士各地环境条件的句子。我们的模型在SWECO25数据立方体中的103个环境变量预测任务上进行了评估,并且我们提出的方法始终优于单一位置或单模态(即仅基于图像或仅文本)基准方法。 当我们按主题组分析变量时,结果表明,在气候、土壤条件、种群和土地利用/覆盖变量方面,性能显著提高,这突显了在结合文字和图片数据时考虑空间背景的重要性。
https://arxiv.org/abs/2601.08750
We propose a novel piecewise smooth image model with piecewise constant local parameters that are automatically adapted to each image. Technically, the model is formulated in terms of factor graphs with NUP (normal with unknown parameters) priors, and the pertinent computations amount to iterations of conjugate-gradient steps and Gaussian message passing. The proposed model and algorithms are demonstrated with applications to denoising and contrast enhancement.
我们提出了一种新颖的分段平滑图像模型,该模型具有局部常数参数,并且这些参数会自动适应每一张图像。技术上讲,该模型通过带有未知参数正态先验(NUP)的因子图进行表述,相关的计算涉及共轭梯度步骤和高斯消息传递的迭代。我们通过去噪和对比度增强的应用展示了所提出的模型和算法的有效性。
https://arxiv.org/abs/2601.08749
Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning this http URL, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority this http URL aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token this http URL work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
当前的上下文增强方法,如检索增强生成,在解决知识密集型推理任务方面至关重要。然而,它们通常遵循一种僵化且粗放的战略,在每一个步骤都执行检索操作。这种不加选择的方法不仅导致不必要的计算成本增加,还通过使上下文中充斥着无关信息而降低了性能。为了克服这些限制,我们引入了一种名为“代理式上下文演化”(ACE)的框架,该框架借鉴了人类元认知的理念,并动态决定是否需要寻求新的证据或利用现有的知识进行推理。ACE采用了一个中心协调器代理通过多数投票来制定战略决策,旨在交替激活检索代理和推理代理,以便分别执行外部检索和内部分析及细化工作。通过消除重复的检索步骤,ACE保持了简洁且经过演化的上下文。 在具有挑战性的多跳问题回答基准测试上进行的大量实验表明,ACE在准确性方面显著优于竞争基线模型,并实现了高效的令牌利用率。这项研究为推进复杂知识密集型任务中的上下文演化生成提供了宝贵的见解。
https://arxiv.org/abs/2601.08747
In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.
在文本到SQL的任务中,现有的基于大模型(LLM)的方法通常会在提示中包含详尽的数据库模式,导致较长的上下文长度和增加的预填充延迟。尽管用户查询通常集中在重复使用的表集上,提供了跨不同查询共享KV缓存的机会,但当前的推理引擎如SGLang 和 vLLM 在处理表格顺序不同的用户查询时会生成冗余前缀缓存副本,从而造成效率低下。 为了解决这一问题,我们提出了离线预计算表格表示作为KV缓存,并在需要时在线查询所需的缓存。我们方法的关键在于,在保持表之间主外键关系的同时进行表格缓存的计算。此外,为了便于推理过程中的高效KV缓存查找,我们构建了一个Table Trie结构。 为了进一步提高缓存性能,我们引入了一套包含查询重排名策略和并行化模型推理与缓存加载管道在内的缓存管理系统,以提升缓存命中率。实验结果显示,我们的方法TableCache 在Time to First Token (TTFT) 上实现了高达3.62倍的加速,并且几乎没有性能损失。
https://arxiv.org/abs/2601.08743
Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
归因推理,即预测观察行为背后潜在意图的能力,是大型语言模型(LLMs)在多智能体环境中运作的一项关键但研究不足的能力。传统自然语言推理(NLI),实际上无法捕捉复杂互动系统所需的那种细微、以意图为导向的推理能力。为解决这一差距,我们引入了归因NLI(Att-NLI),这是一个通过融入社会心理学原则来扩展NLI框架的方法,用于评估智能体进行假设生成(关于潜在意图)和后续演绎验证(得出有效的逻辑结论)的能力。 我们通过一个名为“Undercover-V”的文本游戏实例化了Att-NLI,并使用三种不同推理能力和外部工具访问权限的LLM代理进行了实验:仅使用演绎推理的标准NLI代理;采用溯因-演绎推理的Att-NLI代理;以及在进行溯因-演绎推理时使用外部定理证明器的神经符号Att-NLI代理。 广泛的实验证明了归因推理能力存在一个明确的层级,神经符号代理始终表现出色,平均胜率为17.08%。我们的结果强调了Att-NLI在开发具有复杂推理能力智能体中的作用,并同时突显了神经符号AI在构建能够在多智能体环境中行动的理性LLM智能体方面可能产生的影响。
https://arxiv.org/abs/2601.08742
Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
大型语言模型(LLMs)在处理包含数千行数字数据、多个链接工作表以及嵌入式可视化内容(如图表和收据)的企业级电子表格时,面临着推理困难的问题。之前最先进的电子表格推理方法通常依赖于单个工作表压缩或全上下文编码,这限制了可扩展性,并且无法反映真实用户如何与复杂多模态的工作簿进行互动。我们推出了FRTR-Bench,这是首个针对大规模多模态电子表格推理的基准测试,包括30个企业级Excel工作簿,涵盖了将近四百万个单元格和超过50张嵌入式图像。 为了应对这些挑战,我们提出了“从行到推理”(From Rows to Reasoning, FRTR),这是一种先进的、基于检索增强生成框架的方法。该方法将Excel工作簿分解为细粒度的行、列和区块嵌入,并采用混合词典密集型检索与逆排序融合(Reciprocal Rank Fusion, RRF)技术,同时整合多模态嵌入来推断数值和视觉信息。 我们对六个LLM模型进行了FRTR测试,在使用Claude Sonnet 4.5时达到了74%的准确率,这相比于之前最先进的方法仅有24%的准确率有了显著提高。在SpreadsheetLLM基准上,FRTR与GPT-5一起使用时达到了87%的准确率,并且相比上下文压缩方法减少了大约50%的token使用量。
https://arxiv.org/abs/2601.08741
Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
知识图谱(KGs)提供了结构化的证据,可以作为大型语言模型(LLM)进行知识密集型问答推理的基础。然而,许多实际应用中的知识图谱是私有的,并且将检索到的三元组或探索痕迹发送给闭源的LLM API会带来泄漏风险。现有的隐私处理方法主要集中在对实体名称的遮蔽上,但这种方法仍然存在四个局限性:语义掩盖下的结构信息泄露、不可控的远程交互、脆弱的多跳和多实体推理以及有限的经验重用对于稳定性和效率的限制。 为了解决这些问题,我们提出了PrivGemo这一隐私保护框架,用于知识图谱支持的检索增强型推理,并通过内存引导式暴露控制来管理这一点。PrivGemo采用了双塔设计,既能保持原始KG数据的本地化存储,又能在去标识化的视图上进行远程推理,这超出了简单的名称掩蔽范围,能够同时限制语义和结构信息的泄露。 PrivGemo支持多跳、多实体的推理能力,通过检索连接所有主题实体的匿名长路径来实现这一点,并且在本地KG中进行接地和验证。此外,层次控制器与隐私意识型经验记忆进一步减少了不必要的探索和远程交互次数。 在六个基准数据集上的全面实验表明,PrivGemo取得了总体最佳效果,超越最强基线模型高达17.1%的表现差距。更重要的是,PrivGemo使得较小规模的模型(例如Qwen3-4B)能够达到与GPT-4-Turbo相当的推理性能。 通过这种方法,不仅提升了隐私保护的程度,还增强了LLM在知识密集型任务中的稳定性和效率,同时支持了小型化模型的能力扩展。
https://arxiv.org/abs/2601.08739
Automating Infrastructure-as-Code (IaC) is challenging, and large language models (LLMs) often produce incorrect configurations from natural language (NL). We present TerraFormer, a neuro-symbolic framework for IaC generation and mutation that combines supervised fine-tuning with verifier-guided reinforcement learning, using formal verification tools to provide feedback on syntax, deployability, and policy compliance. We curate two large, high-quality NL-to-IaC datasets, TF-Gen (152k instances) and TF-Mutn (52k instances), via multi-stage verification and iterative LLM self-correction. Evaluations against 17 state-of-the-art LLMs, including ~50x larger models like Sonnet 3.7, DeepSeek-R1, and GPT-4.1, show that TerraFormer improves correctness over its base LLM by 15.94% on IaC-Eval, 11.65% on TF-Gen (Test), and 19.60% on TF-Mutn (Test). It outperforms larger models on both TF-Gen (Test) and TF-Mutn (Test), ranks third on IaC-Eval, and achieves top best-practices and security compliance.
自动化基础设施即代码(IaC)的生成面临挑战,大型语言模型(LLMs)通常从自然语言(NL)生成错误配置。我们提出了一种名为TerraFormer的神经符号框架,用于IaC的生成和变异,该框架结合了监督微调与形式验证工具引导的强化学习,并利用形式化验证工具来提供关于语法、可部署性和合规性的反馈。通过多阶段验证和迭代LLM自我修正,我们整理了两个大规模高质量的数据集:TF-Gen(152K实例)和TF-Mutn(52K实例),它们分别用于IaC生成与变异的训练。 在与包括Sonnet 3.7、DeepSeek-R1 和 GPT-4.1 等约50倍大的模型在内的17个最先进的LLM进行对比测试时,结果显示TerraFormer相较于其基础LLM,在IaC-Eval上提高了15.94%,在TF-Gen (Test) 上提高了11.65%,在TF-Mutn (Test) 上提高了19.60%。该模型还在TF-Gen (Test) 和 TF-Mutn (Test) 两个数据集上超过了更大的模型,并在IaC-Eval中排名第三,同时达到了最佳实践和安全合规的顶级标准。 这段内容主要介绍了TerraFormer框架以及它相对于其他大型语言模型的优势和性能改进。
https://arxiv.org/abs/2601.08734
Accurate delineation of acute ischemic stroke lesions in MRI is a key component of stroke diagnosis and management. In recent years, deep learning models have been successfully applied to the automatic segmentation of such lesions. While most proposed architectures are based on the U-Net framework, they primarily differ in their choice of loss functions and in the use of deep supervision, residual connections, and attention mechanisms. Moreover, many implementations are not publicly available, and the optimal configuration for acute ischemic stroke (AIS) lesion segmentation remains unclear. In this work, we introduce ISLA (Ischemic Stroke Lesion Analyzer), a new deep learning model for AIS lesion segmentation from diffusion MRI, trained on three multicenter databases totaling more than 1500 AIS participants. Through systematic optimization of the loss function, convolutional architecture, deep supervision, and attention mechanisms, we developed a robust segmentation framework. We further investigated unsupervised domain adaptation to improve generalization to an external clinical dataset. ISLA outperformed two state-of-the-art approaches for AIS lesion segmentation on an external test set. Codes and trained models will be made publicly available to facilitate reuse and reproducibility.
在MRI中准确地描绘急性缺血性脑卒中的病变是诊断和管理的重要组成部分。近年来,深度学习模型已被成功应用于此类病变的自动分割任务上。尽管大多数提出的架构基于U-Net框架,但它们主要通过选择不同的损失函数、深层监督、残差连接以及注意力机制来区别开来。此外,许多实现并未公开发布,并且针对急性缺血性脑卒中(AIS)病变分割的最佳配置仍然不明确。 在这项工作中,我们引入了ISLA(缺血性脑卒中病变分析仪),这是一种新的深度学习模型,用于从弥散加权MRI图像中进行AIS病变的分割。该模型是在三个多中心数据库上训练而成的,这些数据库包含超过1500名急性缺血性脑卒中的参与者的数据。通过系统地优化损失函数、卷积架构、深层监督以及注意力机制,我们开发了一个稳健的分割框架。此外,为了提高在外部临床数据集上的泛化能力,我们还研究了无监督领域自适应技术。 ISLA在外部分割测试集中超越了两个最先进的AIS病变分割方法。代码和训练好的模型将被公开发布,以促进再利用和可重复性研究。
https://arxiv.org/abs/2601.08732
Despite its promise, imitation learning often fails in long-horizon environments where perfect replication of demonstrations is unrealistic and small errors can accumulate catastrophically. We introduce Cago (Capability-Aware Goal Sampling), a novel learning-from-demonstrations method that mitigates the brittle dependence on expert trajectories for direct imitation. Unlike prior methods that rely on demonstrations only for policy initialization or reward shaping, Cago dynamically tracks the agent's competence along expert trajectories and uses this signal to select intermediate steps--goals that are just beyond the agent's current reach--to guide learning. This results in an adaptive curriculum that enables steady progress toward solving the full task. Empirical results demonstrate that Cago significantly improves sample efficiency and final performance across a range of sparse-reward, goal-conditioned tasks, consistently outperforming existing learning from-demonstrations baselines.
尽管模仿学习展现出巨大潜力,但在长周期环境中却常常失败。在这些环境中,完美复制演示是不切实际的,并且小错误可能会累积成灾难性的后果。我们提出了Cago(能力感知目标采样),这是一种新颖的学习自示范方法,旨在减轻对专家轨迹进行直接模仿时所依赖的脆弱性。与以往仅依靠演示来初始化策略或塑造奖励的方法不同,Cago 动态跟踪代理在跟随专家路径时的能力,并利用这一信号选择中间步骤——即略微超出当前能力范围的目标——以引导学习过程。这最终形成了一种自适应的教学大纲,从而能够稳步向着解决完整任务迈进。 实证结果显示,在一系列稀疏回报、目标条件的任务中,Cago 显著提高了样本效率和最终性能,并且始终优于现有的从演示学习的基准方法。
https://arxiv.org/abs/2601.08731
Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision
场景图生成(SGG)面临的一个主要问题是长尾分布,其中少数谓词类别占据了主导地位,而许多其他类别则被严重低估。这种现象导致模型在处理罕见关系时表现不佳且偏向性较强。为了解决这个问题,无偏SGG方法通过实施去偏策略来进行干预,但通常以牺牲空间理解能力为代价,结果过分依赖于语义先验。 为了克服这一问题,我们引入了Salience-SGG,这是一种新的框架,其中包含一个迭代显著度解码器(ISD),它强调具有显着空间结构的三元组。为此,我们提出了无语义依赖的显著度标签来指导ISD的工作。在Visual Genome、Open Images V6和GQA-200上的评估表明,Salience-SGG达到了最先进的性能,并且与现有的无偏SGG方法相比,在其空间理解能力方面有所提升,这从成对定位平均精度(Pairwise Localization Average Precision)的提高中可以明显看出。
https://arxiv.org/abs/2601.08728
A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.
在扩展指令遵循(Instruction Following,IF)任务中的强化学习时,一个核心信念是使用多样化的可验证的硬约束和不可验证的软约束的混合对于泛化到未见过的指令至关重要。然而,在这项工作中,我们通过系统的实证研究对这一普遍共识提出了挑战。令人意外地发现,仅基于硬约束训练的模型在性能上始终优于混合数据集上训练的模型。大量的实验揭示了奖励精度而非约束多样性是有效对齐的主要驱动力。 大型语言模型(LLM)裁判在检测假响应时存在低召回率的问题,导致严重的奖励劫持现象,从而削弱了多样性的益处。此外,通过分析注意力机制发现高精度奖励能够培养出一种可转移的元技能用于IF任务。受到这些见解的启发,我们提出了一种简单但有效的数据为中心的优化策略,优先考虑奖励精度。 在五个基准测试中进行评估时,我们的方法不仅比竞争基线高出13.4%的性能表现,而且训练时间减少了58%,同时保持了超出指令遵循之外的强大泛化能力。我们的研究结果表明需要一种范式的转变:从对数据多样性的无差别追求转向高精度奖励的方向发展。 总结: 这项工作的核心发现是,在强化学习中用于指令跟随任务时,精确的硬约束比多样化但可能不够准确的数据集更为重要。通过专注于提高奖励系统的准确性而非其多样性,研究团队成功地提高了模型的表现,并缩短了训练时间。这一成果挑战了业界普遍接受的观点,即数据多样性对获得泛化能力强的学习算法至关重要,而提出了一种新的方法论:优先考虑高精度的反馈系统以促进更有效的学习和适应新指令的能力。
https://arxiv.org/abs/2601.04954
Localization is a fundamental capability for autonomous robots, enabling them to operate effectively in dynamic environments. In Robocon 2025, accurate and reliable localization is crucial for improving shooting precision, avoiding collisions with other robots, and navigating the competition field efficiently. In this paper, we propose a hybrid localization algorithm that integrates classical techniques with learning based methods that rely solely on visual data from the court's floor to achieve self-localization on the basketball field.
本地化是自主机器人的一项基本能力,使其能够在动态环境中有效运行。在Robocon 2025中,精确且可靠的定位对于提高投篮精度、避免与其他机器人碰撞以及高效地导航比赛场地至关重要。本文提出了一种融合传统技术与基于学习的方法的混合定位算法,该方法仅依赖于球场地板上的视觉数据,在篮球场上实现自我定位。
https://arxiv.org/abs/2601.08713
The incorporation of advanced control algorithms into prosthetic hands significantly enhances their ability to replicate the intricate motions of a human hand. This work introduces a model-based controller that combines an Artificial Neural Network (ANN) approach with a Sliding Mode Controller (SMC) designed for a tendon-driven soft continuum wrist integrated into a prosthetic hand known as "PRISMA HAND II". Our research focuses on developing a controller that provides a fast dynamic response with reduced computational effort during wrist motions. The proposed controller consists of an ANN for computing bending angles together with an SMC to regulate tendon forces. Kinematic and dynamic models of the wrist are formulated using the Piece-wise Constant Curvature (PCC) hypothesis. The performance of the proposed controller is compared with other control strategies developed for the same wrist. Simulation studies and experimental validations of the fabricated wrist using the controller are included in the paper.
将先进的控制算法融入假肢手显著提高了其模仿人类手部复杂动作的能力。本文介绍了一种基于模型的控制器,它结合了人工神经网络(ANN)方法与滑模控制器(SMC),用于集成在名为“PRISMA HAND II”的假手中的一种肌腱驱动软连续腕关节。我们的研究重点是开发一个能够快速响应并减少计算工作量的控制器,在手腕运动期间提供良好的性能。 所提出的控制器包括一个用于计算弯曲角度的人工神经网络和一个用于调节肌腱力的滑模控制器。使用分段常数曲率(PCC)假设来制定腕关节的动力学和运动模型。我们将本文提出的方法与为同一腕关节开发的其他控制策略进行了比较,并且论文中包含了使用该控制器对制造出的手腕进行的仿真研究和实验验证。
https://arxiv.org/abs/2601.08711
Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper "Evaluating Model Explanations without Ground Truth", we proposed three principles of explanation evaluation and a new method "AXE" to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.
可解释的人工智能(XAI)致力于生成能够揭示模型内部工作原理的解释。对于一组性能相似的Rashomon模型,解释提供了一种区分个别模型行为的方法,有助于选择用于部署的模型。然而,所使用的解释器不同,解释本身也会有所不同,并且需要进行评估。在论文“Evaluating Model Explanations without Ground Truth”中,我们提出了三个解释评估的原则以及一种新的方法"AXE"来评估特征重要性解释的质量。随后,我们在文中说明了依赖于将模型解释与理想的真实基准解释进行比较的评估指标会掩盖Rashomon集合内部的行为差异。遵循我们提出的原理的解释评估将会揭示这些差异,有助于从Rashomon集合中选择模型。 从Rashomon集合中选择替代模型可以保持相同的预测结果,但可能会误导解释者产生错误解释,并且可能会导致评估方法认为这些错误解释质量很高。我们的提议解释评估方法AXE能够以100%的成功率检测到这种针对解释的对抗性漂白行为(adversarial fairwashing)。与基于模型敏感度或真实基准比较等先前的解释评估策略不同,AXE可以确定何时使用保护属性进行预测。 总结来说,这项研究提出的AXE方法为评估机器学习模型的可解释性提供了一种新的视角,并能够更有效地识别那些可能会误导用户的虚假解释。这种方法不仅改进了我们对模型行为的理解,还提高了模型选择过程中的透明度和可靠性。
https://arxiv.org/abs/2601.08703
Accurate and generalisable segmentation of stroke lesions from magnetic resonance imaging (MRI) is essential for advancing clinical research, prognostic modelling, and personalised interventions. Although deep learning has improved automated lesion delineation, many existing models are optimised for narrow imaging contexts and generalise poorly to independent datasets, modalities, and stroke stages. Here, we systematically evaluated stroke lesion segmentation using the nnU-Net framework across multiple heterogeneous, publicly available MRI datasets spanning acute and chronic stroke. Models were trained and tested on diffusion-weighted imaging (DWI), fluid-attenuated inversion recovery (FLAIR), and T1-weighted MRI, and evaluated on independent datasets. Across stroke stages, models showed robust generalisation, with segmentation accuracy approaching reported inter-rater reliability. Performance varied with imaging modality and training data characteristics. In acute stroke, DWI-trained models consistently outperformed FLAIR-based models, with only modest gains from multimodal combinations. In chronic stroke, increasing training set size improved performance, with diminishing returns beyond several hundred cases. Lesion volume was a key determinant of accuracy: smaller lesions were harder to segment, and models trained on restricted volume ranges generalised poorly. MRI image quality further constrained generalisability: models trained on lower-quality scans transferred poorly, whereas those trained on higher-quality data generalised well to noisier images. Discrepancies between predictions and reference masks were often attributable to limitations in manual annotations. Together, these findings show that automated lesion segmentation can approach human-level performance while identifying key factors governing generalisability and informing the development of lesion segmentation tools.
从磁共振成像(MRI)中准确且通用地分割脑卒中病变对于推动临床研究、预后建模和个性化干预至关重要。尽管深度学习已经改善了自动病变勾画的效果,但许多现有的模型针对狭窄的影像学背景进行优化,并在独立的数据集、模式以及不同阶段的脑卒中上推广效果不佳。在这里,我们使用nnU-Net框架系统地评估了跨越急性期和慢性期多种异质性公开MRI数据集上的脑卒中病变分割情况。模型是在扩散加权成像(DWI)、流体衰减反转恢复序列(FLAIR)和T1加权MRI上训练并测试的,并在独立的数据集中进行评价。 无论在哪一阶段,模型均表现出稳健的推广能力,其分割准确性接近报告的人为标记的一致性。性能随影像学模式和训练数据特性变化而不同:在急性期脑卒中时,基于DWI训练的模型始终优于FLAIR基线模型,多模态组合仅带来轻微提升;而在慢性期脑卒中时,增加训练集大小能够提高表现,但在几百个案例之后回报递减。病变体积是决定准确性的一个关键因素:较小的病变更难以分割,并且在受限体积范围训练的模型推广效果差。 此外,MRI图像质量也限制了通用性:基于低质量扫描训练出的模型不能很好地迁移到其他数据集,而基于高质量数据训练出来的模型则能够良好地适应噪声较大的影像。预测结果与参考掩模之间的差异往往归因于人工注释本身的局限性。综上所述,这些发现表明自动化病变分割可以接近人类水平的表现,并识别影响推广性的关键因素,从而为开发病变分割工具提供信息。
https://arxiv.org/abs/2601.08701