As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
随着大型语言模型(LLM)在教育应用中的日益普及,设计和评估能够产生个性化且符合教学目标输出的LLM提示词的需求也越来越大。本研究提出了一种可通用、系统的评估提示词的方法,并通过分析结构化对话活动中生成的后续问题来展示这种方法的有效性。该方法涉及六种不同的提示模板的设计与测试,这些模板采用了现有的提示工程模式,每种提示强调了不同的教学策略。通过一种适应其他教育应用的锦标赛式评估框架对这六个模板进行了比较。竞赛使用Glicko2等级分系统,并由八位评判员从格式、对话支持和适合学习者三个维度上评估问题配对。数据来自120名真实用户在三种不同教育场景下的交互。 研究结果显示,与其它模板相比,在一对一的比较中,一个专门针对策略性阅读设计的提示词获得了81%到100%的不同胜率。该提示结合了人物角色和上下文管理模式,并旨在支持如自我导向学习等元认知学习策略的设计理念。这种方法展示了教育技术研究人员如何能够系统地评估并改进提示设计,从非系统的提示工程转向基于证据的、为教育应用优化的提示开发方法。
https://arxiv.org/abs/2601.16134
Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
动机推理——即个体在处理信息时可能有动机得出某种结论,不论这种结论是否准确或已预先确定——已被广泛研究为一种人类现象。然而,尚不清楚基础大语言模型(LLM)是否会模仿这些动机变化。通过复制4项先前的政治动机推理研究,我们发现基础LLM的行为并不与预期的人类行为一致。此外,不同模型中的基础LLM行为在某些方面存在相似性,例如标准差较小和对论点强度评估不准确。我们强调这些发现对于使用LLM来自动化诸如调查数据收集和论点评估等任务的研究人员的重要性。
https://arxiv.org/abs/2601.16130
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
微调特定任务的多语言大型语言模型(LLM)涉及在包含所有所需语言样本的多语言数据集上训练该模型。用额外的数据更新一个或多个支持的语言,或者添加对新语言的支持,则需要重新训练整个模型,这在计算效率方面是低效的,并且会形成严重的维护瓶颈。最近关于合并多任务多语言模型的研究显示出提高质量的潜力,但其计算和维护效率尚未被研究。 在这项工作中,我们首次从效率角度提供了这种合并策略的第一个集中分析,在三个独立的任务上进行了评估。我们证明了在保持质量一致的情况下实现了显著的效率提升:该合并方法将初始训练时间减少了高达50%。我们也展示了更新单个语言并重新合并作为模型维护的一部分可以比重新训练整个多语言模型节省超过60%的训练成本。我们在公开和专有的行业数据集上证明了这一点,确认这种方法不仅适用于学术研究已经探讨过的设置,也适合工业使用案例。 简而言之,本文通过评估一个特定的合并策略展示了提高效率的同时保持质量不变的优点,并且该方法在实际应用中展现出显著的成本节约效果。
https://arxiv.org/abs/2601.16127
Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
组成图像检索(CIR)是跨模态理解中的一个重要且复杂的任务。当前的CIR基准测试通常包含有限的查询类别,无法捕捉到真实世界场景中多样化的需求。为了弥补这一评估差距,我们利用图像编辑来实现对修改类型和内容的精确控制,并构建了一条能够跨越广泛类别合成查询的流水线。通过这条管道,我们创建了EDIR,这是一个新型的细粒度CIR基准测试集。EDIR包括5000个高质量的查询,这些查询结构化分布在五个主要类别和十五个子类别中。 对13种跨模态嵌入模型进行全面评估后,我们发现了一个显著的能力差距;即使是当前最先进的模型(如RzenEmbed和GME)也无法在所有子类别上保持一致性表现,这进一步强调了我们的基准测试的严格性。通过对比分析,我们还揭示了现有基准中存在的固有局限性,例如模态偏见和类别覆盖不足的问题。 此外,一个针对特定领域的训练实验展示了我们基准的有效性。该实验通过区分可以使用定向数据解决的任务类别与揭示当前模型架构内在限制的任务类别来明确任务挑战。
https://arxiv.org/abs/2601.16125
Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controller, comprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body controller, as a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective behaviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.
我们提出了一种控制框架,该框架将基于模型的双足行走与残差强化学习(RL)结合在一起,以实现在现实世界不确定性中的稳健和适应性步行。我们的方法利用了一个基于模型的控制器,包括发散运动成分(DCM)轨迹规划器和全身控制器,作为可靠的基线策略。为了应对动力学建模不准确和传感器噪声等不确定性的挑战,我们引入了一种通过域随机化RL训练得到的残差政策。关键在于,我们使用一个具有访问到真实动力学特权信息的基于模型的预言家策略,在训练期间监督残差策略并通过一种新的监督损失进行指导。这种监督使策略能够高效地学习矫正行为以补偿未建模的影响,而无需大量的奖励塑形。 我们的方法在各种随机条件下表现出增强的鲁棒性和泛化能力,并为双足行走中的仿真到现实转换提供了可扩展解决方案。
https://arxiv.org/abs/2601.16109
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
在当今的数字世界中,气候不实信息已成为一个主要挑战,尤其是随着误导性的图片和视频在社交媒体上广泛传播。这些虚假声明往往极具说服力且难以察觉,这可能会延迟应对气候变化的行动。尽管视觉-语言模型(VLM)已被用于识别视觉上的不实信息,但它们仅依赖于训练时已有的知识。这种限制使得它们无法有效推理最近发生的事件或更新的情况。本文的主要目标是通过将VLM与外部知识相结合来克服这一局限性。通过检索最新的信息,如反向图像搜索结果、在线事实核查和可信的专家内容,系统能够更好地评估图片及其声明是否准确、具有误导性、虚假或无法验证。这种方法提高了模型处理现实世界中气候不实信息的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
https://arxiv.org/abs/2601.16108
Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
尽管马尔巴(Mamba)模型在高光谱图像(HSI)分类方面取得了显著进展,但在定义高效的自适应标记序列以提高性能方面仍面临重大挑战。为此,本文提出了一种由聚类引导的空间-光谱马尔巴(CSSMamba:Clustering-guided Spatial-Spectral Mamba)框架,旨在更好地应对这些挑战,并作出以下贡献。 首先,为了实现高效且自适应的标记序列并提升马尔巴模型的表现,我们将聚类机制整合到空间马尔巴架构中,从而构建了一个由聚类引导的空间马尔巴模块(CSpaMamba),该模块减少了马尔巴序列长度并提高了马尔巴特征学习的能力。 其次,为了提高空间和光谱信息的学习效果,我们结合了CSpaMamba模块与一个光谱马尔巴模块(SpeMamba),形成了一个完整的由聚类引导的空间-光谱马尔巴框架。 第三,为进一步提升特征学习能力,我们引入了一种注意力驱动的标记选择机制来优化马尔巴模型中的标记序列。 最后,为了以一致的方式无缝地将聚类整合到马尔巴模型中,我们设计了一个可学聚类模块(Learnable Clustering Module),该模块能够在自适应的情况下学习集群成员资格。 在帕维亚大学、印度普林斯和辽宁01数据集上的实验表明,CSSMamba相较于最先进的CNN、Transformer以及基于马尔巴的方法,在准确性方面表现出更高的性能,并且能够更好地保持边界。
https://arxiv.org/abs/2601.16098
Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
大型语言模型通过诸如Text2SQL、Text2SPARQL和Text2Cypher之类的工具,使用户能够使用自然语言接口访问数据库。这些工具可以将用户的查询问题转化为结构化的数据库查询语句。尽管此类系统提高了数据库的可访问性,但大多数相关研究主要集中在英语上,并且对多语言支持有限。这项工作旨在开发一种可扩展的多语言Text2Cypher方法,该方法在增加新语言时无需重新进行全面微调、避免手动超参数调整的同时,保持接近联合多语言微调后的性能水平。 我们为英语、西班牙语和土耳其语训练了特定的语言LoRA适配器,并通过统一线性合并或具有动态门控的学得融合MLP将它们结合起来。实验结果显示,使用融合MLP可以恢复大约75%由联合多语言微调带来的准确度提升效果,同时只需要较小的数据子集,在所有三种语言中均优于线性合并方法。 这种策略通过仅需一个LoRA适配器和轻量级的MLP重新训练来实现对新语言的增量扩展。学得适配器融合为昂贵的联合微调提供了一种实用替代方案,它在多语言Text2Cypher任务中实现了性能、数据效率与可伸缩性之间的平衡。
https://arxiv.org/abs/2601.16097
We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
我们介绍了一种名为神经粒子自动机(Neural Particle Automata,NPA)的新模型,它是对静态格点系统中的神经细胞自动机(Neural Cellular Automata,NCA)进行拉格朗日泛化的动态粒子系统的扩展。与经典欧拉方法下的NCA不同,在这种情况下,每个单元被固定在像素或体素上,NPA将每个单元视为具有连续位置和内部状态的粒子,这两个参数都通过一个共享且可学习的神经规则更新。基于粒子的这一形式化方法清晰地界定了各细胞个体性,允许异质动态,并仅对存在活动的区域进行计算。 然而,粒子系统也带来了一些挑战:邻居关系是动态变化的,直接实现局部相互作用会导致其复杂度随粒子数量呈二次增长。为了解决这些问题,我们用可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics,SPH)算子替代了网格感知方法,并且利用内存高效、CUDA加速的核心进行支持,从而实现了端到端的大规模训练。 在包括形态发生、点云分类和基于粒子的纹理合成等任务中,我们展示了NPA不仅保留了NCA的关键特性(如鲁棒性和自我再生),而且还赋予粒子系统特有的新行为。综上所述,这些结果将NPA定位为一种紧凑型神经模型,用于学习自组织的粒子动力学。
https://arxiv.org/abs/2601.16096
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
像素级别的能力对于构建互动智能系统至关重要。然而,由于复杂的区域级编码器、专业的分割解码器和不兼容的训练目标,像素级别的多模态大模型(MLLMs)仍然难以扩展规模。为了解决这些挑战,我们提出了SAMTok,这是一种离散的掩码标记器,能够将任何区域掩码转换成两个特殊令牌,并使用这两个令牌以高保真度重建掩码。通过将掩码视为新的语言令牌,SAMTok使基础MLLM(如QwenVL系列)可以通过标准的下一个令牌预测和简单的强化学习来学习像素级别的能力,而无需进行架构修改或专门的损失设计。 基于SAM2,并使用一个掩码编码器和残差向量量化器对2.09亿个多样化的掩码进行训练,SAMTok能够生成离散、紧凑且信息丰富的令牌。通过500万个以SAMTok格式标记的理解与生成数据样本,QwenVL-SAMTok在区域描述、区域VQA(视觉问答)、基于参考的对话、指代分割、场景图解析以及多轮互动分割等任务上取得了当前最优或可比的结果。 我们进一步引入了一个文本答案匹配奖励机制,使掩码生成过程中的强化学习更加高效,在GRES和GCG基准测试中带来了显著改进。我们的结果表明,为MLLM提供强大的像素级别能力提供了一种可扩展且简单的方法。 我们的代码和模型已公开可用。
https://arxiv.org/abs/2601.16093
Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point's location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points' locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.
聚类是一种基本问题,目标是将一组元素(如代理或数据点)划分为若干集群,使得同一集群内的元素彼此之间的距离更近,而非同集群的元素之间距离较远。本文介绍了一个新的框架来研究带有延迟的在线非中心聚类,在这种情况下,每个元素以单个时间点的形式在一个有限度量空间中依次到达,并应被分配到一个集群中,但是分配不一定需要即时完成。 具体来说,在每个元素到达时,其位置会被揭示出来,而一个在线算法必须立即决定将该点分配给现有的某个集群或创建一个新的只包含这个点的集群。然而,我们允许决策可以延后进行,并为此付出延迟成本,这不同于通常假设的即时决策模型。这一设定带来了关键挑战:目标是同时最小化每个集群内部元素之间的总距离成本和由于推迟分配所导致的整体延迟成本。 在经典最坏情况到达模式中,即点以任意顺序到达时,没有任何算法的竞争比(competitive ratio)能够优于关于点数量对数的次线性值。为了克服这种强烈的不可能性结果,我们关注于随机到达模型,在此模型下,每个元素的位置独立地从一个未知但固定的概率分布在有限度量空间中抽取而来。针对这一情况,我们提出了一种算法,它在某种意义上是常数竞争性的:随着点的数量增加,输出聚类的期望总成本与最优离线聚类的成本之比被限制在一个常数值内。这种结果为超越最坏情形对手提供了希望。
https://arxiv.org/abs/2601.16091
Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.
大型语言模型(LLM)代理在长时间互动中常常表现出语气和人格的突然变化,这反映了没有明确时间结构来管理代理级别的状态。尽管之前的工作强调了转轮本地的情感或静态情感分类,但显性情感动态对长时段内代理行为塑造的作用仍被忽视。这项工作研究了通过强加外在情感状态的动力学结构是否可以诱导多回合对话中的时间连贯性和可控恢复。 我们引入了一个保持连续“效价-唤醒度-支配度”(VAD)状态的代理级别情感子系统,该状态独立于语言模型并由一阶和二阶更新规则控制。即时的情感信号通过固定的无记忆估算器提取,并通过指数平滑或动量动力学进行时间集成。由此产生的感情状态被注入生成过程而不改变模型参数。 使用固定25轮对话协议,我们比较了无状态、一阶和二阶情感动态的效果。无状态的代理无法展示连贯的行为轨迹或恢复能力,而状态持久性则允许延迟响应并确保可靠的恢复。二阶动力学引入了随着动量增加的情感惯性和滞后效应,揭示了稳定性和反应性之间的权衡关系。
https://arxiv.org/abs/2601.16087
Computing the conditional mode of a distribution, better known as the $\mathit{maximum\ a\ posteriori}$ (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce $\mathit{probably\ approximately\ correct}$ (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.
计算概率分布的条件模式,即所谓的最大后验估计(MAP),是概率推理中的一个基本任务。然而,MAP估计通常是不可行的,在许多常见的结构约束和近似方案下仍然难以解决。我们引入了“大概正确”(PAC)算法来进行MAP推断,这些算法在变量和固定计算预算下提供了理论上最优的解决方案。我们使用可以从有限样本中估算的信息理论度量来表征PAC-MAP的可操作性条件。我们的PAC-MAP求解器通过具有适当架构的概率电路高效实现。我们开发的随机化策略可以作为独立的MAP推断技术,或者用来改进流行的启发式方法,并为其解决方案提供严格的保证。实验确认了在一系列基准测试中我们的方法带来的益处。 以下是更加细化和明确的翻译: 计算分布的条件模式(即最大后验估计,MAP)是概率推理中的一个核心任务。然而,由于一般情况下该问题难以解决,在许多常见的结构约束和近似方案下仍然保持其复杂性。我们引入了“大概正确”(PAC)算法来执行MAP推断,这些算法可以在给定的变量或固定计算资源预算内提供最优解。为了确定PAC-MAP在哪些条件下的有效性,我们使用信息理论度量来进行评估,而这些度量可以从有限的数据样本中估计出来。 我们的PAC-MAP求解器通过设计良好的概率电路高效地实现。此外,我们开发的随机化策略不仅可以用作独立的MAP推断技术,并且还可以用来改进流行的启发式方法,从而提供更可靠的解决方案。实验结果表明,在一系列基准测试任务中,该方法能够带来显著的优势和效果。 这样的翻译保持了原文的技术性和专业性,同时也确保了语言表达的准确与通顺。
https://arxiv.org/abs/2601.16083
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
从单目视频中重建人体运动是计算机视觉中的一个基本挑战,具有广泛的应用场景,包括增强现实/虚拟现实、机器人技术和数字内容创作。然而,在实际环境中由于频繁的遮挡问题,这一任务仍然极具挑战性。基于回归的方法虽然效率高但对缺失观测非常敏感,而优化和扩散方法则通过牺牲推理速度并增加预处理步骤来提高鲁棒性。为了解决这些问题,我们利用最近在生成式掩码建模方面的进展,并提出了一种用于遮挡下人体运动恢复的框架——MoRo(Masked Modeling for human motion Recovery under Occlusions)。 MoRo是一种针对遮挡具有鲁棒性的端到端生成框架,它将运动重建视为一个视频条件下的任务,在全局坐标系中从RGB视频高效地恢复人类运动。通过掩码建模,MoRo能够自然处理遮挡问题,并支持高效的端到端推理。为了克服成对的视频-动作数据稀缺的问题,我们设计了一种跨模态学习方案,该方案从一组异构的数据集中学习多模式先验:(i)一种在MoCap数据集上训练的动作轨迹感知运动先验;(ii)一种基于图像的姿态先验,在图像姿态数据集上进行训练,捕捉每帧中多样的姿势;以及(iii)一个视频条件下的掩码变换器,该模型融合了动作和姿态的先验,并通过在视频-动作数据集上的微调与视觉线索结合运动动力学以实现稳健推理。 在EgoBody和RICH数据集上进行的大量实验表明,在遮挡条件下,MoRo在准确性和运动逼真度方面显著优于最先进的方法,而在非遮挡场景中则表现出相当的性能。此外,MoRo能够在单个H200 GPU上以每秒70帧的速度实现实时推理。
https://arxiv.org/abs/2601.16079
One of the core advantages of SE2(3) Lie group framework for navigation modeling lies in the autonomy of error propagation. In the previous paper, the theoretical analysis of autonomy property of navigation model in inertial, earth and world frames was given. A construction method for SE2(3) group navigation model is proposed to improve the non-inertial navigation model toward full autonomy. This paper serves as a counterpart to previous paper and conducts the real-world strapdown inertial navigation system (SINS)/odometer(ODO) experiments as well as Monte-Carlo simulations to demonstrate the performance of improved SE2(3) group based high-precision navigation models.
SE2(3)李群框架在导航建模中的一个核心优势在于误差传播的自主性。在之前的论文中,已经对惯性系、地球系和世界坐标系下导航模型的自主性质进行了理论分析。本文提出了一种构建SE2(3)群导航模型的方法,以改进非惯性导航模型并实现完全自主化。该文作为前一论文的补充,通过真实世界的捷联惯性导航系统(SINS)/里程计(ODO)实验以及蒙特卡洛模拟来展示基于改进后的SE2(3)群高精度导航模型的性能。
https://arxiv.org/abs/2601.16078
Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
基础模型(FMs)在各种视觉任务中展现出了强大的泛化能力。然而,它们在联邦环境中的部署由于计算需求高、通信开销大以及推理成本显著而受到阻碍。为此,我们提出了DSFedMed,这是一种双尺度联邦框架,该框架允许集中式基础模型与轻量级客户端模型之间进行相互知识蒸馏,以用于医学图像分割任务中。为了支持这种知识蒸馏过程,生成了一组高质量的医学图像来替代真实的公开数据集,并提出了一种基于可学习性引导的样本选择策略,以提高双尺度蒸馏中的效率和效果。该双向蒸馏方法使得基础模型能够将通用知识传递给轻量级客户端,同时也能吸收来自客户端的具体见解以优化自身。 在五个医学影像分割数据集上的评估表明,DSFedMed相较于现有的联邦基础模型基线方案,在Dice分数上平均提高了2%,并且减少了近90%的通信成本和推理时间。这些结果展示了资源受限环境下联邦部署的有效性提升与可扩展性的显著进步。
https://arxiv.org/abs/2601.16073
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: this https URL.
翻译如下: Vision-Language Action (VLA) 模型在利用 Vision-Language Models (VLMs) 强大的感知能力来理解环境并直接输出动作方面,已经在机器人操作中取得了显著进展。然而,默认情况下,VLA 模型可能会过分关注任务无关区域中的图像标记(我们称之为“分散注意力的标记”),这种行为可能干扰模型在每一步生成所需的行动标记,从而影响任务的成功率。在这篇论文中,我们介绍了一种简单而有效的即插即用的 Distracting Token Pruning (DTP) 框架,该框架能够动态检测和修剪这些分散注意力的图像标记。通过纠正模型的视觉注意模式,我们的目标是提高任务成功率,并探索模型在不改变其原始架构或添加额外输入的情况下所能达到的最佳性能边界。SIMPLER Benchmark(Li 等人,2024)上的实验表明,我们的方法能够持续实现不同类型新型 VLA 模型的成功率相对提升,在各种类型的模型上展现出泛化能力。进一步分析显示,所有测试模型的任务成功率与任务无关区域中的注意力量之间存在负相关关系,这凸显了 VLA 模型的一个共同现象,可能为未来的研究提供指导方向。 我们已在以下网址发布我们的代码:this https URL。
https://arxiv.org/abs/2601.16065
Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
深度学习在医学图像分割领域取得了显著进展,然而,在不同成像模态和解剖结构之间实现稳健的泛化仍然是一个重大挑战。现有架构(从CNN到Transformer及其混合体)主要编码空间信息,而忽视了捕捉丰富结构和纹理线索的频域表示,这是导致这一限制的关键因素之一。虽然最近有一些研究开始探索特征级别的光谱信息,但在监督级别上融合频率线索——这对于精细目标定位至关重要——仍然很大程度上未被开发。 为此,我们提出Phi-SegNet,这是一种基于CNN的架构,在体系结构和优化层面都整合了相位感知信息。该网络集成了Bi-Feature Mask Former(BFMF)模块,用于融合相邻编码器特征以减少语义差距,并使用相位正则化特征来精炼解码器输出的Reverse Fourier Attention(RFA)块。 通过专门设计的相位感知损失函数将这些特征与结构先验对齐,形成了一个闭环反馈机制,强调了边界的精确性。在涵盖X射线、超声波、组织病理学、MRI和结肠镜检查等领域的五个公开数据集上进行了评估,Phi-SegNet始终取得了最先进的性能,在平均相对改进方面,相较于下一个最佳模型,IoU提高了1.54±1.26%,F1得分提高了0.98±0.71%。 在涉及来自已知域但未经训练的数据集的跨数据集泛化场景中,Phi-SegNet也表现出稳健且优越的表现,彰显了其适应性和模态无关设计。这些发现表明,在特征表示和监督方面利用光谱先验具有潜力,并为实现卓越精细目标定位能力的通用分割框架铺平道路。
https://arxiv.org/abs/2601.16064