Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
在用户查询中检测个人可识别信息(PII)对于确保问答系统中的隐私至关重要。目前的方法主要是在忽略某些PII可能对用户的提问具有上下文相关性的情况下对其进行屏蔽,这导致了响应质量的下降。大型语言模型(LLM)或许能够帮助判断哪些PII是相关的,但由于它们封闭源代码性质和缺乏隐私保障,这些模型不适合处理敏感数据。为了实现保护隐私的PII检测,我们提出了CAPID,一种实用的方法,通过微调本地拥有的小型语言模型(SLM),在将信息传递给LLMs进行问答之前过滤掉敏感信息。 然而,现有的数据集没有捕捉到训练此类模型所需的上下文依赖的相关性。为了解决这一问题,我们提出了一种合成数据生成管道,利用大型语言模型产生一个多样且领域丰富的数据集,涵盖多种PII类型和相关程度级别。使用这个数据集,我们将SLM微调以检测PII片段、分类其类型并估计上下文相关性。 我们的实验显示,基于微调后的SLM的相关性感知PII检测在片段准确性、相关性和类型准确性方面显著优于现有的基线模型,并且在匿名化后保持了更高的下游效用。
https://arxiv.org/abs/2602.10074
Human factors research has long focused on optimizing environments, tools, and systems to account for human performance. Yet, as humanoid robots begin to share our workplaces, homes, and public spaces, the design challenge expands. We must now consider not only factors for humans but also factors for humanoids, since both will coexist and interact within the same environments. Unlike conventional machines, humanoids introduce expectations of human-like behavior, communication, and social presence, which reshape usability, trust, and safety considerations. In this article, we introduce the concept of humanoid factors as a framework structured around four pillars - physical, cognitive, social, and ethical - that shape the development of humanoids to help them effectively coexist and collaborate with humans. This framework characterizes the overlap and divergence between human capabilities and those of general-purpose humanoids powered by AI foundation models. To demonstrate our framework's practical utility, we then apply the framework to evaluate a real-world humanoid control algorithm, illustrating how conventional task completion metrics in robotics overlook key human cognitive and interaction principles. We thus position humanoid factors as a foundational framework for designing, evaluating, and governing sustained human-humanoid coexistence.
人因研究长期以来一直致力于优化环境、工具和系统,以适应人类的表现。然而,随着类人机器人开始与我们共处工作场所、家庭和公共场所,设计挑战也随之扩大。现在我们必须不仅考虑人类因素,还要考虑到类人的因素,因为两者将共同存在于相同的环境中并相互互动。与传统机器不同,类人机器人引入了类似人类的行为、沟通和社会存在感的期望,这重塑了可用性、信任以及安全方面的考量。 在本文中,我们介绍了“类人人因”这一概念作为框架,该框架围绕四大支柱——物理层面、认知层面、社会层面和伦理层面构建而成。这些支柱旨在塑造类人的开发,使其能够与人类有效共存并协作。这个框架描述了人类能力和由人工智能基础模型驱动的一般用途类人能力之间的重叠和差异。 为了展示我们框架的实际效用,我们将该框架应用于评估一个现实世界的类人机器人控制算法中,展示了传统机器人任务完成指标如何忽视关键的人类认知和互动原则。因此,我们把“类人人因”视为设计、评估以及治理人类与类人长久共存的基础框架。
https://arxiv.org/abs/2602.10069
Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.
出界检测(OOD,Out-of-distribution)对于机器学习系统的安全部署至关重要。现有的事后检测器通常依赖于模型置信度得分或特征空间中的似然估计,这往往需要严格的分布假设。在本工作中,我们引入了一种新的范式,并从多样性视角出发来定义OOD检测问题。我们提出了Vendi Novelty Score(VNS),这是一种基于Vendi Scores(VS)的OOD检测器,而VS是一组相似性为基础的多样性度量指标。VNS量化了测试样本如何增加在分布特征集中的VS值,提供了一种无需密度建模即可确定新颖性的原则方法。VNS具有线性时间复杂度、非参数性质,并自然地结合了类条件(局部)和数据集级别(全局)的新颖性信号。在多个图像分类基准测试及网络架构上,VNS实现了最先进的OOD检测性能。值得注意的是,在仅使用1%的训练数据进行计算的情况下,VNS仍能保持这种性能水平,从而可以在内存受限或访问受限的环境中部署。
https://arxiv.org/abs/2602.10062
Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
最近的音乐生成方法依赖于分离表示,通常被标记为结构与音色或局部与全局特征,以实现可控合成。然而,这些嵌入的基本特性仍然未被充分探索。在这项工作中,我们使用一种基于探针任务的方法框架来评估一组用于可控生成的音乐音频模型中的此类分离表示,并且这种方法超出了标准下游任务的范围。所选模型反映了多样化的无监督分离策略,包括归纳偏差、数据增强、对抗目标以及分阶段训练流程。此外,我们还单独分析了特定策略的效果。我们的分析涵盖了四个关键维度:信息性(informativeness)、等变性(equivariance)、不变性(invariance)和分离度(disentanglement),这些特性在不同的数据集、任务及受控转换中被评估。研究发现表明,嵌入的预期语义与其实际语义之间存在不一致之处,这暗示现有的策略未能产生真正意义上的分离表示,并且呼吁重新审视音乐生成中的可控性方法。
https://arxiv.org/abs/2602.10058
Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
深度神经网络,特别是基于变压器的架构,在环境感知中的语义分割方面取得了显著的成功。然而,现有的模型处理视频帧时是独立进行的,从而未能利用时间一致性,而这在动态场景中可以大幅提高准确性和稳定性。在此项工作中,我们提出了一种空间-时间注意(STA)机制,它扩展了变压器注意力块,以纳入多帧上下文,从而使视频语义分割具备稳健的时间特征表示能力。我们的方法将标准的自注意力处理方式修改为能够同时处理时空特征序列,同时保持计算效率,并且只需要对现有架构做出最小改动。STA在各种Transformer架构中具有广泛的适用性,并且无论模型是轻量级还是大规模,在所有情况下均能有效运行。在Cityscapes和BDD100k数据集上的全面评估显示,与单帧基线相比,时间一致性指标提高了9.20个百分点,平均交并比(mean Intersection over Union)最高提升了1.76个百分点。这些结果表明STA作为一种架构增强手段,在基于视频的语义分割应用中具有有效性。
https://arxiv.org/abs/2602.10052
Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection-Over-Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.
当前的实例分割模型在平均预测性能方面表现出色,但缺乏原则性的不确定性量化:其输出未进行校准,并且无法保证预测掩码接近真实目标。为解决这一局限性,我们引入了一种符合预测算法,用于生成实例分割的自适应置信集。给定一幅图像和一个像素坐标查询,我们的算法会为此像素生成一组包含实例预测结果的置信集,并提供了一个可证明的概率保证:至少有一个预测具有较高的交并比(IoU)与真实目标实例掩码相匹配。 我们将该算法应用于农业领域边界划分、细胞分割以及车辆检测中的实例分割示例。实验证明,我们的预测集合根据查询难度的变化而变化,并能达到预定的覆盖率,优于现有的基准方法,如“Learn Then Test”、“Conformal Risk Control”和基于形态学膨胀的方法。 我们提供了该算法的不同版本,既有渐近保证也有有限样本保证。
https://arxiv.org/abs/2602.10045
Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
头部磁共振成像(MRI)通常在严格的监管框架下收集和共享用于研究。这些框架要求在分享数据前移除潜在的身份标识符。然而,即使去除了颅骨信息后,脑实质中仍然包含独特的特征,这些特征可以在不同数据库中的同一参与者的其他MRI图像之间进行匹配,从而构成隐私风险,特别是当有额外的数据特性可用时。现有的监管框架通常规定需要基于某一合理水平的评估来评定此类风险。 先前的研究已经表明,通过脑部MRI可以实现参与者之间的关联,但它们依赖于训练基或计算密集型方法。在这里,我们展示了使用标准预处理后进行图像相似性计算的方法,可以将去除了颅骨信息后的T1加权MRI与个人匹配起来,即使在有其他身份标识符可用的情况下也可能导致重新识别的问题。 我们在跨不同时间间隔、扫描类型、空间分辨率和采集协议的数据样本中实现了近乎完美的链接准确性,即便考虑到潜在的认知衰退。这些结果模拟了跨数据库进行MRI匹配的情况,并且旨在为医疗数据共享的发展提供有意义的贡献,制定更加周到前瞻性的政策。
https://arxiv.org/abs/2602.10043
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
最近的研究表明,在检测过程中加入链式思维(Chain-of-Thought,CoT)推理可以提高模型识别合成图像的能力。然而,过长的推理过程会带来巨大的资源开销,包括令牌消耗和延迟问题,并且对于明显伪造的处理来说这是不必要的浪费。为了解决这个问题,我们提出了Fake-HR1,这是一个大规模混合推理模型,在我们的知识范围内,它是第一个能够根据生成式检测任务的特点自适应地判断是否需要进行推理的模型。为了实现这一目标,我们设计了一个两阶段训练框架:首先进行混合微调(Hybrid Fine-Tuning,HFT)以完成冷启动初始化,然后使用带有混合推理分组策略优化(Hybrid-Reasoning Grouped Policy Optimization,HGRPO)的在线强化学习来隐式地学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够在不同类型的查询中自适应地执行推理,并且在推理能力和生成检测性能方面都超越了现有的大型语言模型,同时显著提高了响应效率。
https://arxiv.org/abs/2602.10042
Forestry cranes operate in dynamic, unstructured outdoor environments where simultaneous collision avoidance and payload sway control are critical for safe navigation. Existing approaches address these challenges separately, either focusing on sway damping with predefined collision-free paths or performing collision avoidance only at the global planning level. We present the first collision-free, sway-damping model predictive controller (MPC) for a forestry crane that unifies both objectives in a single control framework. Our approach integrates LiDAR-based environment mapping directly into the MPC using online Euclidean distance fields (EDF), enabling real-time environmental adaptation. The controller simultaneously enforces collision constraints while damping payload sway, allowing it to (i) replan upon quasi-static environmental changes, (ii) maintain collision-free operation under disturbances, and (iii) provide safe stopping when no bypass exists. Experimental validation on a real forestry crane demonstrates effective sway damping and successful obstacle avoidance. A video can be found at this https URL.
林业起重机在动态、无结构的户外环境中运行,其中同时避免碰撞和负载摆动控制对于安全导航至关重要。现有的方法分别解决了这些挑战,要么集中在预定义的安全路径上的摆动减缓上,要么仅在全球规划层面上进行避碰操作。我们提出了一种专为林业起重机设计的第一款无需碰撞、摆动抑制的模型预测控制器(MPC),该控制器在一个控制框架内统一了两个目标。我们的方法直接将基于LiDAR的环境映射集成到MPC中,使用在线欧几里得距离场(EDF)进行实时环境适应。此控制器同时执行避碰约束和负载摆动抑制,使其能够(i)在准静态环境变化时重新规划路径,(ii)在受到干扰的情况下保持无碰撞运行,并且(iii)当没有绕行路径时提供安全停止功能。在实际林业起重机上的实验验证显示了有效的摆动减缓及成功的障碍物规避。一段展示该成果的视频可以在以下网址找到:[此URL]。
https://arxiv.org/abs/2602.10035
Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
在物理与数字世界融合的系统中,代理(agents)越来越多地承担安全关键性的任务。确保这些代理的安全性通常需要精确定位其姿态以便后续操作。姿态估计可以基于激光雷达传感器、摄像头以及如GPS等外部服务的不同组合来获取。然而,在涉及人身安全的关键领域中,粗略的姿态估计是不够的;在最坏的情况下也无法保证安全性,并且外部服务可能不可信。 为了应对这一挑战,我们提出了一种仅通过相机图像和已知目标几何结构来进行认证的姿态估计方法。这种方法通过对姿态进行形式化界定实现:利用近期可达到性分析(reachability analysis)及形式神经网络验证的成果来计算姿态。我们的实验表明,在合成环境与真实世界环境中,该方法能够有效地且准确地定位代理。 通过这种方式,即便是在缺乏可靠外部服务的情况下,也能确保安全关键任务中的精确性和可靠性。这种方法为实现更加自主、安全和有效的自动化系统开辟了新的途径。
https://arxiv.org/abs/2602.10032
The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
RAG TREC 多语言评估(RAGTIME)赛道的主要目标是研究从多语种源文档生成报告。该赛道创建了一个包含阿拉伯语、中文、英语和俄语文本的文档集,其中包括新闻故事等资料。RAGTIME 包括三种任务类型:多语言报告生成、英语报告生成以及多语言信息检索(MLIR)。共有13支参赛队伍提交了总计125份成果(其中一些由赛道协调员作为基准提供)用于这三个任务。本概览将描述这三项任务,并展示现有结果。
https://arxiv.org/abs/2602.10024
Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.
验证声明的真实性通常需要对文本和视觉证据进行多模态推理,例如分析文字描述和图表图像以验证声明。此外,为了使推理过程透明化,还需要通过文本解释来证明验证结果的合理性。然而,大多数声明验证工作主要集中在仅基于文本证据的推理上,或者忽略了可解释性,这导致了验证结果的不准确和缺乏说服力。为了解决这一问题,我们提出了一种新型模型,该模型可以同时实现证据检索、多模态声明验证以及生成解释。 在证据检索方面,我们构建了一个双层的多模态图来关联声明与证据,其中设计了图像到文本和文本到图像的推理机制来进行多模态检索。对于声明验证,我们提出了令牌级和证据级融合方法,以整合声明和证据的嵌入特征进行多模态验证。在解释生成方面,我们引入了Decoder中的多模态Fusion-in-Decoder来增强可解释性。 最后,鉴于几乎所有数据集都属于通用领域,我们在人工智能领域创建了一个新的科学数据集AIChartClaim,以此来补充和完善声明验证社区的需求。实验结果展示了我们的模型的优势和有效性。
https://arxiv.org/abs/2602.10023
The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at this https URL.
将广泛而动态的知识融入大型语言模型(LLMs)中仍然是一个重大挑战,原因是事实数据与推理模式之间的内在纠缠。现有的解决方案从非参数增强检索的生成方法(RAG)到参数化知识编辑不等,但在实践中往往受限于有限的上下文窗口、检索噪音或灾难性遗忘的风险。在本文中,我们提出了DRIFT,这是一种新颖的双模型架构,旨在明确地将知识提取与推理过程分离。不同于静态提示压缩,DRIFT采用了一个轻量级的知识模型来根据查询动态地将文档片段压缩成隐含的事实标记。这些密集表示被投影到推理模型的嵌入空间中,在维持推理准确度的同时替换了原始冗余文本。大量的实验表明,DRIFT在处理长上下文任务时显著提高了性能,并且在大小相仿的模型中超越了强大的基线方法。我们的方法提供了一个可扩展和高效的范式,用于扩大LLMs的有效上下文窗口及推理能力。代码可在该链接获取。
https://arxiv.org/abs/2602.10021
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
大型语言模型(LLMs)在高风险、特定领域的环境中被越来越多地用于支持问答和决策制定,例如自然灾害响应和基础设施规划,在这些场景中,有效的回答必须传达细微的、关键性的细节以供决策参考。然而,现有的检索增强生成(RAG)和开放式问题解答评估框架主要依赖于表面相似性、事实一致性或语义相关性,往往无法评估回复是否提供了特定领域决策所需的具体信息。为了解决这一不足,我们提出了一种多维度、无需参照的评价框架,从四个互补的角度来评估LLM输出:具体性、对改写和语义扰动的鲁棒性、答案的相关性和上下文利用情况。为此,我们引入了一个精心策划的数据集,包含1,412个特定领域的问答配对,涵盖40种专业角色和七类自然灾害类型,以支持系统的评估工作。此外,我们还进行了人工评价,以评估注释者之间的协议以及模型输出与人类判断之间的一致性,这突显了开放式、领域特定的评价中固有的主观性。我们的结果显示,单一指标不足以孤立地捕捉答案质量,并且在部署LLMs于高风险应用时,需要采用结构化、多指标的评价框架。
https://arxiv.org/abs/2602.10017
Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
在长时间未剪辑的视频中定位和分类精细粒度的任务子段对于安全的人机协作至关重要。与通用活动识别不同,协作操作需要可以直接由机器人执行的任务子标签。我们提出了一种多阶段的人类到机器人的任务分割框架RoboSubtaskNet,该框架结合了增强注意力机制的I3D特征(RGB和光流)以及采用斐波那契膨胀时间表修改后的MS-TCN架构,以更好地捕捉如“伸手-抓取-放置”这类短时域内的转换。网络通过一个包含交叉熵和时间正则化器(截断MSE和过渡感知项)的复合目标进行训练,以减少过度分割并鼓励有效的子任务进展。 为了弥合视觉基准与控制之间的差距,我们引入了RoboSubtask数据集,该数据集包含了医疗保健和工业演示,并在子任务级别进行了注释,旨在确定性地映射到机械臂的基本操作。经验表明,在GTEA和我们的RoboSubtask基准测试(边界敏感性和序列度量)上,RoboSubtaskNet的表现优于MS-TCN和MS-TCN++,同时在长时域的Breakfast数据集上也具有竞争力。具体而言,RoboSubtaskNet在GTEA上的F1@50 = 79.5%,Edit = 88.6%,Acc = 78.9%;在Breakfast上的F1@50 = 30.4%,Edit = 52.0%,Acc = 53.5%;以及在RoboSubtask上的F1@50 = 94.2%,Edit = 95.6%,Acc = 92.2%。我们进一步在一个7自由度的Kinova Gen3机械臂上验证了整个感知到执行的管道,在物理试验中实现了可靠的整体端到端行为(总体任务成功率约为91.25%)。这些结果展示了从子任务级别视频理解到现实世界环境中部署机器人操作的实际路径。 上述内容翻译自原文,详细描述了RoboSubtaskNet框架及其在不同数据集上的性能表现。
https://arxiv.org/abs/2602.10015
Successfully manipulating many everyday objects, such as potato chips, requires precise force regulation. Failure to modulate force can lead to task failure or irreversible damage to the objects. Humans can precisely achieve this by adapting force from tactile feedback, even within a short period of physical contact. We aim to give robots this capability. However, commercial grippers exhibit high cost or high minimum force, making them unsuitable for studying force-controlled policy learning with everyday force-sensitive objects. We introduce TF-Gripper, a low-cost (~$150) force-controlled parallel-jaw gripper that integrates tactile sensing as feedback. It has an effective force range of 0.45-45N and is compatible with different robot arms. Additionally, we designed a teleoperation device paired with TF-Gripper to record human-applied grasping forces. While standard low-frequency policies can be trained on this data, they struggle with the reactive, contact-dependent nature of force regulation. To overcome this, we propose RETAF (REactive Tactile Adaptation of Force), a framework that decouples grasping force control from arm pose prediction. RETAF regulates force at high frequency using wrist images and tactile feedback, while a base policy predicts end-effector pose and gripper open/close action. We evaluate TF-Gripper and RETAF across five real-world tasks requiring precise force regulation. Results show that compared to position control, direct force control significantly improves grasp stability and task performance. We further show that tactile feedback is essential for force regulation, and that RETAF consistently outperforms baselines and can be integrated with various base policies. We hope this work opens a path for scaling the learning of force-controlled policies in robotic manipulation. Project page: this https URL .
成功操控许多日常物品,如薯片等,需要精确的力量调节。未能适当调整力量可能会导致任务失败或对物体造成不可逆的损害。人类可以通过触觉反馈来适应这种力的变化,并且即使在短暂的身体接触中也能实现这一点。我们的目标是赋予机器人这项能力。然而,市面上的机械手爪表现出高昂的成本或是最小作用力过大,这使得它们不适合用于研究与日常敏感物品相关的力控制策略学习。 我们引入了TF-Gripper,这是一个低成本(约150美元)且能够集成触觉感应反馈的力控平行夹爪。它具有0.45-45牛的有效力范围,并且可以兼容不同型号的机械臂。此外,我们还设计了一种与TF-Gripper配套使用的远程操作设备,用于记录人类施加的抓握力量。虽然标准的低频策略可以在这些数据上进行训练,但它们难以应对这种依赖接触反应性的力调节特性。 为了解决这个问题,我们提出了RETAF(REactive Tactile Adaptation of Force),这是一个将抓取力控制与机械臂姿态预测解耦的框架。RETAF通过手腕图像和触觉反馈在高频下调节力量,而基础策略则用于预测末端执行器的姿态以及夹爪开/关动作。 我们在五个需要精确力量调节的真实世界任务中评估了TF-Gripper和RETAF的表现。结果显示,与位置控制相比,直接力控显著提高了抓取稳定性及任务性能。我们进一步证明触觉反馈对力量调节至关重要,并且RETAF在各种基础策略下始终优于基线表现并可进行集成。 希望这项工作能为机器人操作中学习力控策略的规模化应用开辟一条新路径。 项目页面: 这里(请将“this https URL”替换为您实际提供的链接)。
https://arxiv.org/abs/2602.10013
Lane changing in dense traffic is a significant challenge for Connected and Autonomous Vehicles (CAVs). Existing lane change controllers primarily either ensure safety or collaboratively improve traffic efficiency, but do not consider these conflicting objectives together. To address this, we propose the Multi-Agent Safety Shield (MASS), designed using Control Barrier Functions (CBFs) to enable safe and collaborative lane changes. The MASS enables collaboration by capturing multi-agent interactions among CAVs through interaction topologies constructed as a graph using a simple algorithm. Further, a state-of-the-art Multi-Agent Reinforcement Learning (MARL) lane change controller is extended by integrating MASS to ensure safety and defining a customised reward function to prioritise efficiency improvements. As a result, we propose a lane change controller, known as MARL-MASS, and evaluate it in a congested on-ramp merging simulation. The results demonstrate that MASS enables collaborative lane changes with safety guarantees by strictly respecting the safety constraints. Moreover, the proposed custom reward function improves the stability of MARL policies trained with a safety shield. Overall, by encouraging the exploration of a collaborative lane change policy while respecting safety constraints, MARL-MASS effectively balances the trade-off between ensuring safety and improving traffic efficiency in congested traffic. The code for MARL-MASS is available with an open-source licence at this https URL
车道变换在密集交通中是连接和自主车辆(CAVs)面临的重要挑战。现有的车道变更控制器主要专注于确保安全或协作提高交通效率,但没有同时考虑这些相互冲突的目标。为了解决这个问题,我们提出了多智能体安全盾(Multi-Agent Safety Shield, MASS),该系统采用控制屏障函数(Control Barrier Functions, CBFs)设计,旨在实现安全且协同的车道变换。 MASS通过构建一个使用简单算法生成的图来捕捉CAVs之间的多代理互动,从而促进协作。此外,通过整合MASS并定义定制化的奖励函数以优先考虑效率改进,将最先进的多代理强化学习(Multi-Agent Reinforcement Learning, MARL)车道变更控制器进行了扩展,确保了安全性。 因此,我们提出了一种称为MARL-MASS的车道变换控制器,并在拥堵上匝道汇流仿真中对其进行了评估。结果表明,MASS通过严格遵守安全限制来促进协同车道变换并提供安全保障。此外,所提出的定制奖励函数提高了使用安全盾训练的MARL策略的稳定性。 总的来说,通过鼓励探索符合安全约束条件下的协作车道变更政策,MARL-MASS有效地平衡了在拥堵交通中确保安全和提高交通效率之间的权衡关系。MARL-MASS的相关代码可以在这个开源许可链接下获得:[https URL](请将URL更改为实际提供的链接地址)。
https://arxiv.org/abs/2602.10007
Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
Tiếng Việt có hệ thống chính tả âm thanh, nơi mỗi ký tự đối ứng với không quá một âm tiết và ngược lại. Lợi dụng tính trong suốt cao của sự tương ứng giữa các ký tự và âm tiết, chúng tôi đề xuất ViSpeechFormer (Việtnam Speech TransFormer), một phương pháp dựa trên âm tiết cho nhận dạng giọng nói tiếng Việt tự động (ASR). Theo kiến thức của chúng tôi, đây là khung ASR đầu tiên dành cho tiếng Việt mô hình hóa rõ ràng các biểu diễn âm tiết. Các thí nghiệm trên hai tập dữ liệu ASR Tiếng Việt công khai cho thấy ViSpeechFormer đạt được hiệu suất mạnh mẽ, tổng quát tốt hơn đối với từ ngoài danh sách và ít bị ảnh hưởng bởi thiên lệch huấn luyện. Phương pháp dựa trên âm tiết này cũng hứa hẹn cho các ngôn ngữ khác có hệ thống chính tả âm thanh. Mã nguồn sẽ được phát hành sau khi bài báo được chấp nhận.
https://arxiv.org/abs/2602.10003
Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.
最近在3D高斯点阵(3DGS)领域的进展主要集中在加速优化过程的同时保持重建质量。然而,许多提出的改进方法将实现层面的提升与基础算法修改混为一谈,或者为了性能牺牲了精度,这导致了一个复杂的、难以进行公平比较的研究领域。在这项工作中,我们总结并评估了之前3DGS研究中最有效且应用广泛的策略,并在此基础上引入了几种新颖的优化措施。此外,我们还深入探讨了框架中尚未充分研究的部分,包括数值稳定性、高斯截断以及梯度近似等方面的问题。 由此产生的系统Faster-GS提供了一个经过严格优化的算法,并通过一系列全面基准测试进行了评估。实验结果表明,Faster-GS能够使训练速度提升高达5倍的同时保持视觉质量不变,从而为3DGS优化建立了一个新的成本效益高且资源消耗低的基础标准。此外,我们还展示了这些优化措施可以应用于4D(四维)高斯重建上,进而实现高效的非刚性场景优化。 翻译如下: 最近在3D高斯点阵(3DGS)领域的发展主要集中在加速优化的同时保持重建质量。然而,许多提议的方法将实现层面的改进与基础算法的变化交织在一起,或者以牺牲性能为代价换取准确性,导致了一个复杂的研究领域,使得公平比较变得困难。在这项工作中,我们汇总并评估了此前研究中最具有效性和广泛应用策略,并在此基础上增加了几种新颖的优化措施。我们还进一步探索了该框架尚未充分研究的部分,包括数值稳定性、高斯截断和梯度近似等方面。 由此产生的系统Faster-GS提供了一个经过严格优化的算法,并通过一系列全面的基准测试对其进行了评估。我们的实验表明,Faster-GS能够实现高达5倍的训练速度提升同时保持视觉质量不变,从而为3DGS优化建立了一个新的成本效益高且资源消耗低的基础标准。此外,我们还展示了这些优化措施可以应用于4D(四维)高斯重建中,进而实现了高效的非刚性场景优化。
https://arxiv.org/abs/2602.09999
How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.
孩子们如何从有限的语言输入中掌握母语级别的语法结构?根据“语言刺激不足假设”(Poverty of the Stimulus Hypothesis,简称 PoSH),孩子们接收到的言语输入不足以解释某些被牢固学习的一般化现象;因此,许多研究者认为先天的语言限制是必要的。神经语言模型的设计中缺乏这种特定于语言的约束条件,为这一长久存在但颇具争议的论点提供了一个计算测试手段。 我们介绍了一套名为 \poshbench 的训练和评估工具集,该工具针对提问结构、运动中的岛现象及其他处于 PoSH 论据中心的英语现象。在使用10至50百万个开发阶段可信文本进行训练后,即使没有直接的正面证据,我们也发现了所有现象上的泛化迹象——但神经模型仍不如孩子那样数据高效且泛化效果较弱。 进一步地,我们通过三种最近提出的认知动机归纳偏置来改进我们的模型。发现这些偏置提高了模型的一般句法能力,但却未能提升 \poshbench 表现。研究结果挑战了先天语法是通向泛化的唯一途径这一论断,并暗示人类级别的数据效率需要超出此次测试范围的其他归纳偏置。
https://arxiv.org/abs/2602.09992