Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
视觉语言行动(VLA)任务要求在复杂多变的环境中对视觉场景进行推理并执行适应性动作。尽管最近关于VLA推理的研究表明,明确的思维链(CoT)可以提高泛化能力,但由于冗长的推理过程导致了较高的推断延迟。我们提出了一种高效的推理框架Fast-ThinkAct,通过可表达的潜在推理实现了紧凑而高性能的规划。Fast-ThinkAct 通过从教师模型中蒸馏学习,以偏好引导的目标为导向,将操作轨迹对齐,从而同时转移语言和视觉规划能力用于具身控制。这使得增强型策略学习能够有效连接简洁的推理与行动执行。在多个具有挑战性的具身操作和推理基准测试中的广泛实验表明,Fast-ThinkAct 在不牺牲长期规划、少量样本适应性及故障恢复性能的情况下,将最先进的VLA推理模型的推断延迟最多减少了89.3%。
https://arxiv.org/abs/2601.09708
Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
基于Transformer的语言模型在数学推理基准测试中通常能取得很好的结果,但在基本数值理解和算术运算方面仍然表现出脆弱性。一个主要的限制是数字被处理为符号标记,其嵌入并未明确编码数值大小,从而导致系统性的错误。我们引入了一种价值感知的数值表示方法,该方法通过添加专门的前缀令牌来增强标准分词输入,这个令牌的嵌入受到底层数值大小的影响。这种机制直接将数量级信息注入模型的输入空间,并且与现有的标记器和解码器专用Transformer架构兼容。 在算术任务上的评估表明,所提出的方法在不同数值格式、任务及操作数长度上均优于基线方法。这些结果表明,明确编码数值大小是提高语言模型基本数值稳健性的有效且高效的方式。
https://arxiv.org/abs/2601.09706
Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.
代码生成任务的目标是自动化将用户需求转换为可执行代码的过程,从而显著减少手动开发的工作量,并提高软件生产效率。大型语言模型(LLMs)的出现极大地推进了代码生成技术的进步,尽管其效率仍然受到某些内在架构限制的影响。每个标记生成都需要进行一次完整的推理过程,这要求在内存中持续保留上下文信息,并增加资源消耗。现有的研究主要集中在推理阶段的优化上,例如提示压缩和模型量化,而生成阶段则较少被探索。 为了解决这些挑战,我们提出了一种名为ShortCoder的知识融合框架,它能够在保持语义等价性和可读性的同时提高代码生成效率。具体而言,我们引入了以下内容: 1. 十个从抽象语法树(AST)保存转换中衍生出的Python级简化规则,实现了无功能损失的18.1%标记减少。 2. 一个混合数据合成管道,结合基于规则的重写和LLM引导优化,生成ShorterCodeBench,这是一个包含原始代码及其精简版本且具有语义一致性的验证元组集合。 3. 一种微调策略,将简洁性意识注入基础LLMs。 广泛的实验结果表明,与最先进的方法相比,ShortCoder在HumanEval上的性能始终更优,在保持代码生成质量的同时,实现了18.1%至37.8%的生成效率提升。
https://arxiv.org/abs/2601.09703
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
Segment Anything Model 3 (SAM3) 已经建立了一个强大的基础,能够稳健地检测、分割和跟踪视频中的指定目标。然而,在其原始实现中,由于其群组级别的集体记忆选择在复杂的多对象场景下表现不佳,因为它采用了一种基于所有并发目标平均性能的同步决策方法,往往忽视了个体对象的可靠性。为此,我们提出了 SAM3-DMS,这是一种无训练的解耦策略,利用细粒度的记忆选择针对每个单独的对象。实验表明,我们的方法实现了稳健的身份保持和跟踪稳定性。值得注意的是,随着目标密度的增加,我们的优势变得更加明显,为在野外同时进行多目标视频分割奠定了坚实的基础。
https://arxiv.org/abs/2601.09699
3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
基于扩散模型的现代视频生成模型能够产生非常逼真的片段,但它们在计算上效率低下,通常需要数分钟的GPU时间才能生成短短几秒钟的视频。这种低效性对那些需要实时互动的应用程序(如具身人工智能和虚拟现实/增强现实)部署生成式视频构成了关键障碍。本文探讨了一种新的策略:针对静态场景进行相机条件下的视频生成——使用基于扩散模型的生成器来产生一组稀疏的关键帧,然后通过3D重建和渲染合成完整的视频。我们的方法通过将这些关键帧提升到三维表示中,并渲染中间视图,在数百帧之间分摊了生成成本,同时保持了几何一致性。我们进一步引入了一种预测给定相机轨迹所需最优关键帧数量的模型,使系统能够自适应地分配计算资源。最终的方法SRENDER针对简单路径使用非常稀疏的关键帧,而对复杂摄像机运动则采用更密集的关键帧设置。这使得在生成20秒视频时,相较于基于扩散的基本线程,速度提高了40多倍,同时保持了视觉保真度和时间稳定性,为高效的可控视频合成提供了一条实用路径。
https://arxiv.org/abs/2601.09697
LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
大型语言模型(LLMs)正越来越多地被整合到临床工作流程中,但它们通常缺乏临床同理心,这是有效医患沟通的一个关键方面。现有的自然语言处理框架主要集中在反应性地标记医生回复中的同理心,但在预测一般健康查询所需的同理心需求方面支持有限。我们介绍了情感适用性框架(EAF),这是一个基于理论的方法,根据临床、上下文和语言线索来分类患者咨询中情感反应和解释的适用性。我们发布了一个由人类和GPT-4双重标注的真实患者咨询基准数据集。在达成人类共识的数据子集中,我们也观察到了显著的人类与GPT之间的一致性。为了验证EAF的有效性,我们在人工标记和仅GPT标记的数据上训练分类器来预测情感适用性,取得了优异的表现,并超越了启发式方法和零样本LLM基准线。错误分析突显了一些持续存在的挑战:隐含的压力、临床严重程度的模糊性和上下文上的困境,这强调了需要多标注者建模、临床医生在循环中的校准以及文化多样化的标注工作。EAF为识别响应生成前的情感需求提供了一个框架,建立了预测性同理心模型的基准,并支持异步医疗保健中的情感沟通。
https://arxiv.org/abs/2601.09696
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
随着大型语言模型(LLM)的规模不断扩大,后训练修剪作为一种有望在减少计算成本的同时保持性能的方法逐渐崭露头角。现有方法如SparseGPT和Wanda通过逐层权重重构或基于激活量化的幅度剪枝实现高稀疏度,但它们依赖于统一的手工启发式来确定每层的稀疏比率。此外,最近的研究表明,被修剪的LLM在事实性知识方面有显著下降,结构化修剪方法在这种情况下几乎完全丧失了事实问答能力。我们引入了代理引导剪枝,其中基础模型充当自适应剪枝代理,在每次迭代中智能地选择要修剪的层,同时保持关键的知识路径不变。我们的方法通过结合Wanda启发式的权重-激活度量与梯度重要性得分来构建逐层敏感度概况,并将其归一化为z分数以进行模型无关比较。这些统计信息由具有自我反思能力的LLM代理处理,使其能够从先前的修剪结果中学习并迭代优化其策略。一种检查点回滚机制通过在困惑度下降超过阈值时恢复来保持模型质量。我们在Qwen3模型(4B和8B参数)上大约以45%的稀疏度评估我们的方法,结果显示相对于结构化剪枝基线有显著改进:MMLU准确率提高了56%,FreebaseQA上的事实性知识保留提高了19倍,困惑度下降降低了69%。值得注意的是,我们的框架不需要重新训练,在模型无关的方式下运行,并且仅通过2-4次回滚在21-40次迭代中表现出有效的自我纠正能力,表明基础模型可以有效地指导其他基础模型的压缩过程。
https://arxiv.org/abs/2601.09694
Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
大型语言模型(LLM)路由器能够动态地选择最适合给定输入的模型。现有方法通常假设可以访问带有真实标签的数据,但在实践中这种情况往往不可用,尤其是在用户请求分布异质且未知的情况下。我们引入了基于生成数据的路由(RGD),这是一种具有挑战性的设置,在该设置中,路由器仅通过由生成器LLM从高层次的任务描述中产生的查询和答案进行训练。 我们在四个多样化的基准测试集上对带有查询和标签的回答路由器以及仅使用查询的路由器进行了评估,并发现在12种模型中,当生成器质量降低时,回答路由器的表现下降速度比仅使用查询的路由器更快。我们的分析揭示了有效生成器的两个关键特性:它们必须准确地回应自己的问题,并且这些问题应该能够在模型池之间产生足够的性能差异。 接着,我们展示了如何通过过滤这些特征来提高生成数据的质量。此外,我们提出了CASCAL,这是一种新颖的仅使用查询的路由方法,它通过共识投票估计模型的正确性并通过分层聚类识别每个模型的具体技能领域(skill niches)。CASCAL对生成器质量具有显著的鲁棒性,在弱生成器数据上训练时,比最佳回答路由器高出4.6%的绝对准确率。
https://arxiv.org/abs/2601.09692
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
深度研究系统被广泛应用于多步骤的网页研究、分析以及跨来源综合,然而它们的评估仍然面临挑战。现有的基准测试通常需要密集的手动标注来构建任务,依赖静态的评价维度,或者在缺少引用时无法可靠地验证事实。为了弥补这些不足,我们介绍了DeepResearchEval,这是一个用于深度研究任务构建和代理式评估的自动化框架。 对于任务构建部分,我们提出了一种以人物角色驱动的工作流程,该工作流程能够生成基于多样化用户配置文件的真实且复杂的深度研究任务,并通过两个阶段的任务资格和搜索必要性过滤器来保留那些需要多来源证据整合以及外部检索的任务。 在评价方面,我们提出了一个代理式管道方案,包含两个组成部分:一个是自适应的点对点质量评估系统,该系统能够根据每个生成的任务动态地推导出具体化的评价维度、标准和权重;另一个是主动的事实核查机制,它能够在缺少引用的情况下通过网络搜索自主提取并验证报告中的陈述。
https://arxiv.org/abs/2601.09688
Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.
多任务学习(MTL)结合低秩适应(LoRA)已成为大型语言模型(LLMs)参数高效部署的一个有前途的方向。通过在多个任务之间共享单一的适配器,可以显著减少存储开销。然而,这种做法会遇到负面迁移的问题:来自不同任务的冲突梯度更新会导致每个任务的表现低于单任务微调的情况。这个问题在LoRA中更为严重,因为低秩约束限制了优化空间容纳多样化任务需求的能力。 本文提出了一种专门为LoRA的二分结构设计的正交投影方法——Ortho-LoRA。Ortho-LoRA能够动态地将冲突的任务梯度投影到内在LoRA子空间中的彼此正交补集中。在GLUE基准上的广泛实验表明,Ortho-LoRA有效地减轻了任务间的干扰,并超越标准联合训练表现,在计算开销几乎可以忽略不计的情况下,恢复多任务与单任务基线之间的95%性能差距。
https://arxiv.org/abs/2601.09684
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
我们介绍了STEP3-VL-10B,这是一个轻量级的开源基础模型,旨在重新定义高效性和前沿级多模态智能之间的权衡。通过两个战略性的转变来实现STEP3-VL-10B:首先,在超过1.2T多模态标记上进行统一且完全未冻结的预训练策略,结合语言对齐的感知编码器与Qwen3-8B解码器,以建立内在的视觉-语言协同效应;其次,通过超过1k次强化学习迭代实现扩展后的后期训练流水线。至关重要的是,我们实施了并行协调推理(PaCoRe),用于在测试时扩大计算资源分配,重点在于可扩展的感觉推理,探索和综合多样化的视觉假设。因此,尽管其紧凑的10B规模,STEP3-VL-10B与10倍至20倍大的模型(例如GLM-4.6V-106B、Qwen3-VL-235B)以及顶级专有旗舰产品如Gemini 2.5 Pro和Seed-1.5-VL相比,表现不相上下甚至更胜一筹。它在MMBench上记录了92.2%的得分,在MMMU上的得分为80.11%,并在复杂推理任务AIME2025和MathVision中分别取得了94.43%和75.95%的成绩,展示了其卓越的表现能力。我们发布了完整的模型套件,为社区提供了一个强大、高效且可重复验证的基准。
https://arxiv.org/abs/2601.09668
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
多智能体系统已经演变为许多应用程序中由大型语言模型(LLM)驱动的实用协作工具,从多样性和互查中获得了稳健性。然而,多代理强化学习(MARL)训练资源密集且不稳定:协同适应的队友会导致非平稳环境,并且奖励往往稀疏且方差大。因此,我们引入了**多代理推理时间强化学习(MATTRL)**框架,在推理时将结构化的文本经验注入到多智能体决策过程中。MATTRL形成一个多专家团队以进行多次轮次讨论,检索和整合推理时间的经验,并达成共识做出最终决定。此外,我们研究了信用分配机制来构建每个回合级别的经验池,然后将其重新注入对话中。在医学、数学和教育等具有挑战性的基准测试上,与多代理基线相比,MATTRL提高了3.67%的准确率;与相当的单个智能体基线相比,则提高了8.67%的准确率。消融研究分析了不同的信用分配方案,并详细比较它们如何影响训练结果。MATTRL为分布变化鲁棒性的多代理推理提供了一个稳定、有效且高效的路径,无需调参。
https://arxiv.org/abs/2601.09667
Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
单目视觉SLAM技术可以从互联网视频中进行三维重建,并在资源受限的平台上实现自主导航,但它会遭受尺度漂移的问题,即长时间序列下估计尺度逐渐偏离真实值。现有的帧到帧方法通过局部优化实现了实时性能,但由于缺乏不同窗口之间的全局约束而积累了尺度漂移。为了解决这个问题,我们提出了SCE-SLAM(Scene Coordinate Embedding SLAM),这是一种端到端的SLAM系统,它利用场景坐标嵌入来保持尺度一致性,这些嵌入是学习得到的补丁级表示,在一个标准尺度参考下编码了三维几何关系。 该框架包含两个关键模块:由几何引导的聚合模块和场景坐标联合调整模块。前者通过利用3D空间邻近性,借助于几何调制注意力从历史观测中传播尺度信息;后者则通过明确解码自场景坐标嵌入的三维坐标约束将当前估计锚定到参考尺度上。 在KITTI、Waymo和vKITTI数据集上的实验表明了显著改进:我们的方法相较于最佳先前方法,在KITTI数据集上减少了8.36米的绝对轨迹误差,同时保持了每秒36帧的速度,并实现了大规模场景中的尺度一致性。
https://arxiv.org/abs/2601.09665
Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{this https URL}{here}.
在长期视频中识别个体动物对于行为生态学、野生动物监测和畜牧业管理至关重要。传统方法需要大量的手动标注,而现有的自监督方法由于内存限制和时间误差传播的问题,在计算上非常耗费资源且不适合处理长时间序列。我们提出了一种高效、自监督的方法,将动物识别重新定义为一个全局聚类任务而非顺序跟踪问题。我们的方法假设单个视频中存在已知数量的个体(这在实践中是一个常见的场景),只需要边界框检测和总的数量信息。通过帧对采样、使用冻结预训练骨干网络以及利用匈牙利算法进行批量伪标签分配,我们的自引导机制能够在没有身份标签的情况下学习区分性特征。我们从视觉-语言模型中调整了二元交叉熵损失函数,使得这种方法在仅消耗不到1GB的GPU内存(比标准对比方法少一个数量级)的前提下,实现了97%以上的业界领先准确率。在具有挑战性的实际数据集(3D-POP鸽子和8头牛进食视频)上评估时,我们的框架匹配或超越了基于超过1000帧标注训练的监督基线,有效地消除了手动注释瓶颈。这项工作使得在消费级硬件上进行实用且高准确率的动物识别成为可能,并在资源受限的研究环境中具有广泛的适用性。本论文的所有代码都可在[此处](https://this https URL)找到。
https://arxiv.org/abs/2601.09663
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
大型视觉-语言模型,如CLIP,在零样本识别方面表现出色,但在预训练期间很少见到的类别(包括新出现的实体和特定文化的类别)上则表现不佳。为此,我们引入了LiteEmbed,这是一个轻量级框架,用于对CLIP进行少量样例个性化处理,它使得可以在不重新训练其编码器的情况下添加新的类别。LiteEmbed在CLIP词汇表内执行子空间引导的文本嵌入优化,并利用基于PCA(主成分分析)的分解来解开粗略语义方向和细粒度变化之间的联系。 该框架通过两个互补的目标——粗略对齐与精细分离,同时保持全局语义一致性并增强视觉上相似类别间的区分性。一旦完成优化,这些嵌入可以即插即用,在分类、检索、分割和检测任务中无缝替换CLIP原有的文本特征。广泛的实验表明,LiteEmbed相对于先前的方法取得了显著的改进,并确立了其作为将CLIP适应于代表性不足、罕见或未见过类别方面的有效方法的地位。
https://arxiv.org/abs/2601.09661
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
从单张图像中估计出物理上准确且可用于模拟的服装是一个挑战,这是因为缺乏图像到物理属性的数据集以及该问题本身的病态性(ill-posed nature)。此前的方法要么需要多视角捕捉和昂贵的可微分模拟,要么仅预测服装几何形状而没有进行逼真模拟所需的材料特性。我们提出了一种前馈框架,通过首先对一个视觉语言模型进行微调来克服这些限制,使其能够从真实图像中推断出材料组成及织物属性,并随后训练一种轻量级预测器,该预测器将这些属性映射到使用小规模的材料物理测量数据集对应的实际织物参数。我们的方法引入了两个新的数据集(FTAG和T2P),并且无需迭代优化就能从单张图像中生成可用于模拟的服装。实验表明,我们的估计器在材料组成估算和织物属性预测方面实现了更优的准确性,并且通过将这些属性传递到我们物理参数估计算法中,与当前最先进的图像到服装方法相比,还进一步提高了高保真度的模拟效果。
https://arxiv.org/abs/2601.09658
Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
水下视频分析由于低光照、色彩失真和浑浊等因素而面临诸多挑战,这些因素会降低视觉数据质量,并直接影响机器人应用中感知模块的性能。本研究提出了一种名为AquaFeat+的即插即用管道方案,旨在增强自动视觉任务所需的特定特征,而非追求人类观察的质量标准。该架构包括颜色校正、分层特征增强和自适应残差输出等模块,这些模块被端到端训练,并直接由最终应用的损失函数引导。 在FishTrack23数据集中进行训练和评估后,AquaFeat+在目标检测、分类和跟踪指标方面取得了显著提升,验证了其对水下机器人感知任务增强的有效性。
https://arxiv.org/abs/2601.09652
Word Sense Disambiguation (WSD) has been widely evaluated using the semantic frameworks of WordNet, BabelNet, and the Oxford Dictionary of English. However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation. In this work, we perform the largest semantic tagging evaluation of the rule based system that uses the lexical resources in the USAS framework covering five different languages using four existing datasets and one novel Chinese dataset. We create a new silver labelled English dataset, to overcome the lack of manually tagged training data, that we train and evaluate various mono and multilingual neural models in both mono and cross-lingual evaluation setups with comparisons to their rule based counterparts, and show how a rule based system can be enhanced with a neural network model. The resulting neural network models, including the data they were trained on, the Chinese evaluation dataset, and all of the code have been released as open resources.
词义消歧(Word Sense Disambiguation,WSD)广泛使用了WordNet、BabelNet和牛津英语词典的语义框架进行评估。然而,在UCREL语义分析系统(USAS)框架下,并未开展大规模开放性评测工作,仅限于词汇覆盖率或单一语言的评价。在本研究中,我们对基于规则系统的最大规模语义标注进行了评估,该系统使用了USAS框架中的词典资源,覆盖五种不同的语言,并利用四个现有的数据集和一个新颖的中文数据集进行测试。为了克服缺少手动标记训练数据的问题,我们创建了一个新的银级标签英文数据集,在单语和跨语言评估设置中训练并评估各种单一语言和多语言神经模型,并与基于规则的系统进行了对比。展示了如何通过神经网络模型增强基于规则的系统。所生成的神经网络模型、它们接受的数据训练集、中文评测数据集以及所有代码,均已作为开放资源发布。
https://arxiv.org/abs/2601.09648
Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.
文本到图像(T2I)模型越来越受欢迎,生成了网络上大量的人工智能图像。为了比较模型质量,基于投票的排行榜已成为标准,依赖于匿名化模型输出以确保公平性。然而,在这项工作中,我们展示了这种匿名性可以轻易被破解。我们发现,每个T2I模型产生的生成物在图像嵌入空间中形成了独特的集群,从而能够在没有提示控制或训练数据的情况下实现准确去匿名化。使用了22个模型和280个提示(共计15万张图片),我们的基于中心点的方法实现了高精度,并揭示了系统性的、特定于每个模型的特征。此外,我们还引入了一个基于提示级别的区分度指标,并进行了大规模分析,展示了某些提示可以导致近乎完美的可识别性。我们的研究结果暴露了T2I排行榜中的基本安全漏洞,并促进了更强匿名化防御措施的发展。
https://arxiv.org/abs/2601.09647