While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
尽管基于图形用户界面(GUI)的代理在接收明确和完成指令时表现出色,但现实世界的部署要求它们能够与用户的复杂隐含意图对齐。在此项工作中,我们介绍了“PersonalAlign”,即个性化 GUI 代理的任务,该任务需要代理利用长期用户记录作为持续上下文来解决模糊指示中遗漏的偏好,并通过推理长期用户记录为用户提供前瞻性建议。 为了支持这项研究,我们引入了 AndroidIntent 基准测试,旨在评估代理在解析模糊指令并通过推理长期用户记录提供前瞻性建议方面的能力。我们在不同用户的 20,000 条长期记录中标记了 775 种特定于用户的偏好和 215 个常规流程,以进行评价。 此外,我们介绍了层次化意图记忆代理(Hierarchical Intent Memory Agent,简称 HIM-Agent),该代理能够维护一个持续更新的个人记忆,并分层组织用户偏好和常规流程,从而实现个性化。最后,在 AndroidIntent 基准测试中评估了一系列 GUI 代理,包括 GPT-5、Qwen3-VL 和 UI-TARS,进一步的结果表明,HIM-Agent 在执行性能和前瞻性性能方面分别提高了 15.7% 和 7.3%。
https://arxiv.org/abs/2601.09636
Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric "is-a" relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.
分类学构成了跨不同领域结构化知识表示的基础,支持诸如电子商务目录、语义搜索和生物医学发现等应用。然而,手动扩展分类学既费时又难以跟上新概念的涌现速度。现有自动化方法依赖于点基向量嵌入,这种方法模拟的是对称相似性,因此在处理分类学中根本性的非对称“是……的一种”关系方面存在困难。盒式嵌入提供了一种有希望的替代方案,通过这种方式可以实现包含和不相交,但它们面临关键问题:(i) 在交叉边界处不稳定梯度;(ii) 缺乏语义不确定性的概念;以及 (iii) 表现多义性和模糊性的能力有限。我们利用 TaxoBell 来解决这些问题,这是一种高斯盒式嵌入框架,它在盒子几何图形和多元正态分布之间进行转换,在这种转换中均值编码了语义位置而协方差则编码了不确定性。基于能量的优化提供了稳定优化、对含糊概念的稳健建模以及可解释的分层推理能力。在五个基准数据集上进行的广泛实验表明,TaxoBell 在 MRR(平均倒数排名)和 Recall@k 方面显著优于八个最先进的分类学扩展基线,分别提高了约 19% 和大约25%。我们进一步通过错误分析和消融研究展示了 TaxoBell 的优势与不足。
https://arxiv.org/abs/2601.09633
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
尽管大型语言模型(LLM)在自然语言处理任务中表现出色,但在诸如押韵检测和生成这类以音系为基础的现象上仍存在挑战。尤其对于资源较少的语言,例如现代希腊语,这一问题更加明显。本文介绍了一种混合系统,该系统结合了LLM与确定性音系算法,能够实现准确的押韵识别/分析及生成。 我们的方法实现了全面的希腊语押韵类型分类,包括纯式、丰富式、不完整式、马赛克式和同前元音(IDV)模式,并采用了具有音系验证的代理生成管道。我们评估了多种提示策略(零样本、少样本、链式思维、RAG增强),并测试了多个LLM模型,包括Claude 3.7和4.5、GPT-4o、Gemini 2.0以及开源模型如Llama 3.1(8B和70B)和Mistral Large。实验结果显示了一个显著的“推理差距”:虽然母语级别的模型(例如,Claude 3.7)能够直观地完成任务(识别准确率为40%),但在少样本情况下需要更多推理能力的模型(如Claude 4.5)仅在使用链式思维提示时才能达到最先进的性能(54%)。最关键的是,纯LLM生成完全失败(有效诗歌低于4%),而我们的混合验证循环将性能恢复到了73.1%。 我们公开了该系统以及一套关键的、严格清理过的数据集,包含40,000多个押韵语料,来源于Anemoskala和战间期诗歌语料库。这些资源为未来的研究提供了支持。
https://arxiv.org/abs/2601.09631
Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits--structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a {\em pre-unlearning} metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
机器遗忘对于构建值得信赖且合规的语言模型至关重要。然而,不同样本的遗忘效果差异显著:有些样本能够可靠地被删除,而其他一些则在相同的处理下依然存在。我们主张这种差距不仅体现在数据方面,而且反映了内部机制的作用——这些机制编码并保护着已记住的信息。我们从基于模型回路(结构化交互路径)的角度来研究这个问题,这些回路决定了预测是如何形成的。我们提出了电路引导的遗忘难度(CUD),这是一种在遗忘之前计算每个样本连续难度分数的度量方法,使用的是电路层面的信号。 广泛实验表明,CUD能够可靠地区分固有简单和困难的样本,并且其稳定性不受遗忘方法的影响。我们识别出了一些关键的电路层模式,这些模式揭示了难易程度的机制特征:易于遗忘的样本与早期至中期模型部分内较短、较浅的交互相关联,而难以遗忘的样本则依赖于接近后期计算阶段的更长更深路径。 相比现有的定性研究,CUD为分析遗忘难度提供了一个原理性的、细粒度且可解释的方法,并促进了基于模型机制的遗忘方法的发展。
https://arxiv.org/abs/2601.09624
Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at this https URL.
准确且提前识别潜在的入侵目标对于保障铁路运输系统的安全至关重要。然而,现有的大多数系统仅专注于在固定视觉范围内进行物体分类,并采用基于规则的启发式方法来判断入侵状态,往往忽视了那些潜藏入侵风险的目标。为了预测这些风险,需要理解感兴趣对象(OOI)的空间上下文和时间动态特性,这对于传统的视觉模型来说是一个挑战。 为促进深度入侵感知,我们引入了一个新的基准测试平台CogRail,该平台将精心策划的开源数据集与认知驱动的问题-答案注释相结合,以支持时空推理和预测。基于这一基准测试平台,我们使用多模态提示对最先进的视觉语言模型(VLMs)进行了系统的评估,以便识别它们在这个领域的优势和局限性。此外,我们将VLMs进行微调以提高性能,并提出了一种结合位置感知、运动预测和威胁分析这三项核心任务的联合微调框架,从而促进通用基础模型向专门针对认知入侵感知需求调整的模型的有效转变。 广泛的实验表明,当前的大规模多模态模型在处理认知入侵感知所需的复杂时空推理方面存在困难,突显了现有基础模型在这个安全关键领域的局限性。相比之下,我们提出的联合微调框架通过有针对性地适应领域特定的推理要求,显著提升了模型性能,强调了结构化多任务学习在提高准确性和可解释性方面的优势。 代码将在以下网址提供:[链接] (请将[链接]替换为实际的URL)。
https://arxiv.org/abs/2601.09613
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
基于强化学习(RL)的大规模语言模型(LLMs)的增强通常会导致输出多样性降低,这会削弱它们在开放式任务(如创意写作)中的实用性。当前的方法缺乏明确的机制来引导多样性的探索,而是优先考虑优化效率和性能而非多样性。本文提出了一种以半结构化的长思维链(CoT)为基础的RL框架,在这种框架中,生成过程被分解为一系列显式的中间步骤。我们引入了多样规划分支方法,在这一过程中,根据多样性的变化在规划阶段战略性地引入分歧,并且通过群体意识的多样性奖励来鼓励不同的轨迹。实验结果表明,在创意写作基准测试上,我们的方法显著提高了输出多样性而不会牺牲生成质量,并且始终优于现有的基线方法。
https://arxiv.org/abs/2601.09609
Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
大多数多模态情感分析研究都集中在点式回归上。尽管这种做法简单直接,但它对标签噪声敏感,并且忽视了某些样本比另一些更积极这一事实,导致预测不稳定和相关性较差。为了解决这些问题,成对序数学习框架应运而生,通过比较来捕捉相对顺序。然而,这些方法引入了两个新的权衡:首先,它们将所有比较的重要性视为统一的,未能适应性地聚焦于难以排名的样本;其次,它们使用静态排序间隔,无法反映不同情感组之间的语义距离变化。 为了解决这些问题,我们提出了一种两阶段分组排序和校准框架(GRCF),该框架借鉴了集团相对策略优化(GRPO)的理念。我们的框架通过同时保持相对序数结构、确保绝对评分校准以及适应性地关注困难样本来解决这些权衡问题。 具体而言,在第一阶段,我们引入了一种受GRPO启发的动态优势加权间隔排序损失函数,以构建一个精细粒度的序数结构。在第二阶段,则采用了一个由MAE驱动的目标函数,用于对齐预测幅度。 为了验证其泛化能力,我们将GRCF扩展到分类任务中,包括多模态幽默检测和讽刺检测。GRCF在核心回归基准测试上取得了最先进的性能,并且在分类任务中也显示出强大的泛化性。
https://arxiv.org/abs/2601.09606
Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60\% on views that the non-augmented policy fails completely on.
基于视觉的机器人操作策略最近取得了显著的成功,但仍然对分布变化(如相机视角的变化)较为脆弱。由于机器人演示数据稀缺且通常缺乏适当的视角变化,模拟提供了一种大规模收集不同视角下机器人演示的方法,但带来了视觉仿真到现实转换的挑战。为了弥合这一差距,我们提出了MANGO——一种未配对图像翻译方法,该方法采用新颖的分割条件下的InfoNCE损失函数、高度正则化的判别器设计以及修改后的PatchNCE损失。我们发现这些要素对于在从模拟到真实的转换过程中保持视角一致性至关重要。训练MANGO时,我们只需要少量来自真实世界的固定相机数据,但可以生成通过翻译仿真观察来获得的多种未见视角。在这个领域内,MANGO超越了我们测试的所有其他图像翻译方法。增强模仿学习策略的数据由MANGO生成后,在未经增广的策略完全失败的角度下,这些策略能够实现高达60%的成功率。
https://arxiv.org/abs/2601.09605
In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.
近年来,由于在自然语言处理(NLP)任务中表现出色,基础模型变得非常流行。这些模型通常包含数亿甚至数十亿个参数,在训练和生产系统中资源消耗大,导致成本增加。本文的重点是研究将基础模型应用于音乐信息检索(MIR)任务时的模型规模缩减问题。我们的研究结合了在语音识别领域首次应用的Branchformer架构与SummaryMixing方法,并加入随机量化过程。为了便于重现实验结果,我们在公开的数据集上进行预训练,并补充了一个可与文献中其他私有数据集相媲美的专有数据集。我们通过一个包含多种下游MIR任务的框架来确保评估的稳健性。我们的结果显示,与使用多头自注意力机制的最新模型相比,我们的架构在保持竞争力的同时,将模型规模减少了8.5%至12.3%。
https://arxiv.org/abs/2601.09603
Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.
点云配准是计算机视觉中的一个核心主题,对齐算法持续改进以实现更高的鲁棒性。常用的方法通过评估点云之间的欧氏距离,并最小化目标函数(如均方根误差RMSE)来工作。然而,这些方法在点云预配准良好时最为有效;密度差异、噪声、空洞和有限的重叠等问题可能会使结果大打折扣。传统的迭代最近点算法(ICP)等方法需要选择一个固定的点云,因为欧氏距离不具备交换性。当仅有一个点云存在问题时可以进行调整,但在实际情况中,两个点云都可能受到影响,往往需要预处理步骤。 作者引入了一种基于差分熵的新度量标准,旨在作为优化框架内的目标函数用于精细的刚性三维点云配准,并将其命名为迭代差分熵最小化(IDEM)。该度量不受固定点云选择的影响,在变换过程中会揭示一个清晰的极小值对应最佳对齐状态。进行了多个案例研究并将结果与RMSE、Chamfer距离和Hausdorff距离等方法的结果进行了比较。所提出的度量标准即使在密度差异、噪声、空洞以及部分重叠的情况下也能有效工作,而这些情况下RMSE通常无法实现最优配准。
https://arxiv.org/abs/2601.09601
Handwriting remains an essential skill, particularly in education. Therefore, providing visual feedback on handwritten documents is an important but understudied area. We outline the many challenges when going from an image of handwritten input to correctly placed informative error feedback. We empirically compare modular and end-to-end systems and find that both approaches currently do not achieve acceptable overall quality. We identify the major challenges and outline an agenda for future research.
手写仍然是一个重要的技能,特别是在教育领域。因此,在手写文档中提供视觉反馈是一个重要但研究不足的领域。我们概述了从手写输入图像到正确放置的信息性错误反馈之间存在的诸多挑战。通过实证分析模块化系统和端到端系统的差异,我们发现目前这两种方法都无法达到令人满意的总体质量水平。我们识别出了主要的挑战,并为未来的研究制定了议程。
https://arxiv.org/abs/2601.09586
In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.
在复杂的环境中,自主机器人的导航和环境感知对SLAM(同步定位与映射)技术提出了更高的要求。本文提出了一种新颖的方法,利用热信息来增强3D点云地图的语义信息。该方法首先执行可见光图像和红外图像的像素级融合,然后将实时LiDAR(激光雷达)点云投影到这一融合后的图像流上。接下来,系统在热成像通道中分割出热源特征,从而即时识别高温目标,并将这些温度信息作为语义层添加到最终生成的3D地图中。这种方法生成的地图不仅具有精确的几何结构,还具备对环境的关键性语义理解能力,在快速灾害评估和工业预防维护等特定应用领域中尤为有价值。
https://arxiv.org/abs/2601.09578
We study permutation (jumbled/Abelian) pattern matching over a general alphabet $\Sigma$. Given a pattern P of length m and a text T of length n, the classical task is to decide whether T contains a length-m substring whose Parikh vector equals that of P . While this existence problem admits a linear-time sliding-window solution, many practical applications require optimization and packing variants beyond mere detection. We present a unified sliding-window framework based on maintaining the Parikh-vector difference between P and the current window of T , enabling permutation matching in O(n + {\sigma}) time and O({\sigma}) space, where {\sigma} = |{\Sigma}|. Building on this foundation, we introduce a combinatorial-optimization variant that we call Maximum Feasible Substring under Pattern Supply (MFSP): find the longest substring S of T whose symbol counts are component-wise bounded by those of P . We show that MFSP can also be solved in O(n + {\sigma}) time via a two-pointer feasibility maintenance algorithm, providing an exact packing interpretation of P as a resource budget. Finally, we address non-overlapping occurrence selection by modeling each permutation match as an equal-length interval and proving that a greedy earliest-finishing strategy yields a maximum-cardinality set of disjoint matches, computable in linear time once all matches are enumerated. Our results provide concise, provably correct algorithms with tight bounds, and connect frequency-based string matching to packing-style optimization primitives.
https://arxiv.org/abs/2601.09577
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
我们提出了OpenVoxel,这是一种无需训练的算法,用于对稀疏体素进行分组和加标签,以实现开放词汇3D场景理解任务。给定从一个3D场景多视角图像获得的稀疏体素栅格化(SVR)模型,我们的OpenVoxel能够生成有意义的群组来描述场景中的不同物体。此外,通过利用强大的视觉语言模型(VLMs)和多模态大型语言模型(MLLMs),我们的OpenVoxel成功地为每个群组生成了有信息量的场景地图,从而支持进一步的3D场景理解任务,如开放词汇分割(OVS)或指代表达分割(RES)。与以前的方法不同,我们这种方法无需训练,并且不从CLIP/BERT文本编码器引入嵌入向量。相反,我们直接使用MLLM进行文本到文本搜索。通过广泛的实验,我们的方法在最近的研究中表现出色,特别是在复杂的指代表达分割(REM)任务上。代码将开源。
https://arxiv.org/abs/2601.09575
Longitudinal brain MRI is essential for lifespan study, yet high attrition rates often lead to missing data, complicating analysis. Deep generative models have been explored, but most rely solely on image intensity, leading to two key limitations: 1) the fidelity or trustworthiness of the generated brain images are limited, making downstream studies questionable; 2) the usage flexibility is restricted due to fixed guidance rooted in the model structure, restricting full ability to versatile application scenarios. To address these challenges, we introduce DF-DiffCom, a Kolmogorov-Arnold Networks (KAN)-enhanced diffusion model that smartly leverages deformation fields for trustworthy longitudinal brain image completion. Trained on OASIS-3, DF-DiffCom outperforms state-of-the-art methods, improving PSNR by 5.6% and SSIM by 0.12. More importantly, its modality-agnostic nature allows smooth extension to varied MRI modalities, even to attribute maps such as brain tissue segmentation results.
纵向脑部MRI对于生命周期研究至关重要,但高流失率常常导致数据缺失,从而复杂化了分析过程。深度生成模型已经被探索,但是大多数此类模型仅依赖于图像强度信息,这带来了两个关键限制:1) 生成的脑部图像的质量或可信度有限,使得下游研究结果值得怀疑;2) 因为模型结构中的固定引导,其使用灵活性受限,无法完全适应多样的应用场景。为了应对这些挑战,我们引入了DF-DiffCom——一种增强型扩散模型,利用Kolmogorov-Arnold Networks (KAN) 智能地结合变形场进行可靠的纵向脑部图像补全。在OASIS-3数据集上训练后,DF-DiffCom超越了现有的最佳方法,PSNR提高了5.6%,SSIM提高了0.12。更重要的是,其模态无关特性允许无缝扩展到各种MRI模态,甚至可以应用于诸如脑组织分割结果之类的属性图。
https://arxiv.org/abs/2601.09572
Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.
自主系统在基于模式的信息搜集对话中面临一个仪器差距,缺少用于监测获取效率和检测问题何时变得无效的每一轮观察数据。我们引入了Dialogue Telemetry(DT),这是一种测量框架,在每次问答交互后生成两个与模型无关的信号:(i) 进度估计器(PE),它量化每个类别剩余的信息潜力(包括基于比特的变体);(ii) 延迟指数(SI),用于检测一种可观察到的失败特征,即重复对同一类别的探查并收到语义相似但边际收益低的回答。延迟指数可以在不需要因果诊断的情况下标记这种模式,在将恶化原因归因于特定因素可能不切实际的情境下支持监控过程。我们通过使用基于大型语言模型(LLM)模拟的受控搜索和救援(SAR)启发式的访谈验证了DT的有效性,区分有效的对话痕迹与停滞的对话痕迹,并展示了集成DT信号到强化学习(RL)策略中的下游效用。在这些情境下,DT提供了可解释的每轮监控工具,当停滞带来操作成本时可以提高策略性能。
https://arxiv.org/abs/2601.09570
Large language models typically represent Chinese characters as discrete index-based tokens, largely ignoring their visual form. For logographic scripts, visual structure carries semantic and phonetic information, which may aid prediction. We investigate whether low-resolution visual inputs can serve as an alternative for character-level modeling. Instead of token IDs, our decoder receives grayscale images of individual characters, with resolutions as low as $8 \times 8$ pixels. Remarkably, these inputs achieve 39.2\% accuracy, comparable to the index-based baseline of 39.1\%. Such low-resource settings also exhibit a pronounced \emph{hot-start} effect: by 0.4\% of total training, accuracy reaches above 12\%, while index-based models lag at below 6\%. Overall, our results demonstrate that minimal visual structure can provide a robust and efficient signal for Chinese language modeling, offering an alternative perspective on character representation that complements traditional index-based approaches.
大型语言模型通常将中文字符表示为基于离散索引的标记,而忽略了它们的视觉形式。对于表意文字而言,视觉结构携带有语义和语音信息,这可能有助于预测任务。我们研究了低分辨率的视觉输入是否可以作为字符级建模的一种替代方法。我们的解码器接收的是单个字符的灰度图像,而不是标记ID,并且这些图像的分辨率可低至$8 \times 8$像素。令人惊讶的是,在这种情况下,模型仍能达到39.2%的准确率,这与基于索引的方法基准线(即39.1%)相当。在资源匮乏的情况下,这样的设置还表现出明显的“快速启动”效应:训练开始仅0.4%时,模型的准确率就已超过12%,而基于索引的方法在此阶段则不足6%。总体而言,我们的研究结果表明,最小限度的视觉结构可以为中文语言建模提供一种稳健且高效的信号,并为传统的基于索引的方法提供了补充视角,以完善字符表示方法。
https://arxiv.org/abs/2601.09566
Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
微缩浮动点(MXFP)格式作为大型语言模型(LLM)的一种有前景的低精度表示方式已经崭露头角。尽管提出了多种后训练量化(PTQ)算法,但大多数研究集中在整数量化上,而针对MXFP格式的应用性和行为模式的研究却相对较少。为填补这一空白,本研究系统地调查了在不同MXFP格式下的PTQ情况,涵盖超过7种PTQ算法、15个评估基准以及3类LLM家族模型。主要发现包括: 1. MXFP8格式能够持续提供接近无损的性能表现,而相比之下,MXFP4引入了显著的准确性下降,并且仍旧是一个挑战性的任务。 2. PTQ在不同MXFP格式下的效果很大程度上取决于格式兼容性;某些算法范式比其他方法更为有效。 3. 不同模型家族和模态下PTQ的表现趋势高度一致,在多模态LLM中,量化敏感度主要由语言模型决定而非视觉编码器所主导。 4. 量化比例因子是MXFP4中的一个重要错误来源,而通过简单的预缩放优化策略可以显著减轻其影响。 这些结果共同为现有PTQ方法适应于MXFP量化的实践提供了宝贵的指导。
https://arxiv.org/abs/2601.09555
We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.
我们探讨了一种情况,即目标领域可以访问,但实时数据标注不可行。为此,我们希望从大规模的数据服务器中构建一个替代的训练集,从而获得具有竞争力的模型。对于这个问题,因为目标域通常表现出不同的模式(即代表数据分布的语义簇),如果训练集中不包含这些目标模式,那么模型性能将受到影响。 以往的研究工作通过迭代改进算法来解决这一问题,而我们的研究则探索了往往被忽视的数据服务器结构优化潜力。受网络搜索引擎层次结构的启发,我们引入了一个分层数据服务器,并结合双轨模式匹配算法(BMM)以对齐源域和目标域的模式。对于每个目标模式,在服务器数据树中寻找最佳模式匹配,无论其大小如何。通过双轨匹配,我们的目的是让所有目标模式都能与源模式一对一地实现最优匹配。 相较于现有的训练集搜索算法,我们展示了所匹配的服务器模式构成的训练集在跨对象重识别(re-ID)和检测任务中的域差距始终较小。因此,在使用我们搜索到的训练集进行训练后,模型具有更高的准确性。 BMM使数据为中心的无监督领域适应(UDA)方法与现有以模型为中心的UDA方法相辅相成。通过结合BMM和其他现有的UDA方法(如伪标签),进一步改进了性能。
https://arxiv.org/abs/2601.09531
Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.
以自我为中心的人机交互(EHOI)分析对于工业安全至关重要,但缺乏标注的特定领域数据阻碍了稳健模型的发展。为了解决这一挑战,我们引入了一个结合合成数据和基于扩散的过程的数据生成框架,该框架能够用逼真的个人防护装备(PPE)增强现实世界图像。我们提出了GlovEgo-HOI,这是一个新的工业EHOI基准数据集,以及GlovEgo-Net模型,该模型整合了Glove-Head和Keypoint-Head模块以利用手部姿态信息来增强交互检测能力。大量的实验展示了所提出的生成框架和GlovEgo-Net的有效性。为了促进进一步的研究,我们在GitHub项目中发布了GlovEgo-HOI数据集、增强管道以及预训练模型:[GitHub项目链接]。 (注意:原文末尾提供的具体GitHub链接未给出,因此在翻译时保持了形式上的表述。)
https://arxiv.org/abs/2601.09528