Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.
近期,大型视觉-语言模型在规划和控制任务中表现出令人印象深刻的性能,这激发了人们将其应用于真实世界机器人技术的兴趣。然而,在具身环境中应用这些模型进行推理时,它们的局限性在于难以整合跨越多天收集的大量图像所代表的长期经验。当前的视觉语言模型(VLMs)通常只能同时处理几百张图片以内的情况,凸显出在具身场景中更有效地管理长期记忆的需求。为了有效评估这些模型在长周期控制任务中的表现,基准测试必须特别针对那些成功依赖于良好记忆能力的情境。现有的长时间视频问答基准忽略了像物体操作和导航这样的具身挑战,这些问题需要低级技能以及对过去互动的细致推理。 此外,在具身代理中有效地整合记忆不仅包括回忆相关的历史信息,还包括根据这些信息执行动作,这意味着在研究这些方面时应将它们作为一个整体而非孤立地看待。在这项工作中,我们引入了一个新的基准测试,用于评估Habitat模拟器中的长距离具身任务的记忆能力。该基准测试涵盖60个需要环境内持续互动和情境意识的任务,并且可以扩展到更长时间和更具挑战性的版本中去,以实现对记忆和推理的可伸缩性评估。我们还提出了基线方法,这些方法将最先进的VLM与低级导航策略相结合,用以评估它们在这些依赖于强大记忆能力任务上的表现,并指出了改进的方向。
https://arxiv.org/abs/2506.15635
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
在生成常识推理任务(如CommonGen)中,生成式大型语言模型(LLMs)会构造包含所有给定概念的句子。然而,在关注指令遵循能力时,如果提示指定了概念顺序,LLMs必须生成符合指定顺序的句子。为解决这一问题,我们提出了Ordered CommonGen,这是一个旨在评估LLM组合泛化能力和指令遵循能力的基准测试。该基准通过测量有序覆盖率来评估模型是否按规定的顺序生成概念,从而同时评估这两种能力。我们使用36种不同的LLMs进行了全面分析,并发现尽管大多数LLMs通常理解了指令意图,但对特定概念顺序模式的偏见常常导致输出多样性低或即使改变概念顺序后结果仍然相同的问题。此外,即使是遵循指令最严格的LLM也只能达到约75%的有序覆盖率,这突显出在提高指令遵循能力和组合泛化能力方面存在改进的需求。
https://arxiv.org/abs/2506.15629
$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE's predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.
**目标:** 大脑预测年龄差(BrainAGE)是一种反映大脑健康的神经影像生物标志物。然而,训练稳健的BrainAGE模型需要大量数据集,这通常受到隐私问题的限制。本研究评估了在缺血性卒中患者接受机械取栓治疗的情况下,联邦学习(FL)用于BrainAGE估计的效果,并探讨其与临床表型和功能结局的关系。 **方法:** 我们使用来自16家医院中心共1674名卒中患者的FLAIR脑部图像。我们实施了标准机器学习和深度学习模型,在三种数据管理策略下进行BrainAGE估算:集中式学习(合并所有数据)、联邦学习(在每个站点本地训练)以及单个站点学习。报告了预测误差,并考察了BrainAGE与血管风险因素(如糖尿病、高血压、吸烟)及卒中后3个月的功能结局之间的关系。通过逻辑回归评估调整年龄、性别、血管风险因素、卒中严重程度、MRI和动脉穿刺之间的时间间隔、先前的静脉溶栓以及再通结局后的脑龄对这些结果的预测价值。 **结果:** 虽然集中式学习提供了最准确的预测,但联邦学习在所有情况下均优于单一站点模型。对于所有模型而言,糖尿病患者的BrainAGE显著较高。将具有良好和较差功能结局的患者进行比较,并通过多元回归预测这些结局显示了BrainAGE与卒中后恢复之间的关联的重要性。 **结论:** 联邦学习可以在不集中化数据的情况下实现准确的年龄预测。脑龄与血管风险因素及卒中后的恢复之间存在显著的相关性,这突显了其在卒中护理预后模型中的潜在价值。
https://arxiv.org/abs/2506.15626
We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. this https URL.
我们介绍了H OIDiNi,这是一个用于合成逼真且合理的物体-人交互(HOI)的文本驱动扩散框架。HOI生成极具挑战性,因为它需要高度准确的身体接触以及多种运动模式。虽然现有文献在现实性和物理正确性之间做出权衡,但H OIDiNi通过使用预训练扩散模型中的噪声空间进行直接优化(利用Diffusion Noise Optimization, DNO),能够同时实现这两点。这成为可能的原因在于我们观察到该问题可以分为两个阶段:以物体为中心的阶段主要做离散的手-物接触位置选择;以人为中心的阶段则细化全身运动,从而实现这一蓝图。这种结构化的办法允许在不牺牲动作自然性的情况下进行精确的手部与物体接触控制。 仅在GRAB数据集上的定量、定性和主观评估清晰地表明H OIDiNi在接触准确性、物理有效性以及整体质量方面超越了先前的工作和基准。我们的结果展示了生成复杂且可控的交互(包括抓取、放置及全身协调)的能力,这完全由文本提示驱动。 您可以在此链接中找到更多相关信息:[此链接](https://example.com)(原文中未提供实际链接,请根据实际情况替换)。
https://arxiv.org/abs/2506.15625
Large Language Models (LLMs) have shown promise as decision-makers in dynamic settings, but their stateless nature necessitates creating a natural language representation of history. We present a unifying framework for systematically constructing natural language "state" representations for prompting LLM agents in repeated multi-agent games. Previous work on games with LLM agents has taken an ad hoc approach to encoding game history, which not only obscures the impact of state representation on agents' behavior, but also limits comparability between studies. Our framework addresses these gaps by characterizing methods of state representation along three axes: action informativeness (i.e., the extent to which the state representation captures actions played); reward informativeness (i.e., the extent to which the state representation describes rewards obtained); and prompting style (or natural language compression, i.e., the extent to which the full text history is summarized). We apply this framework to a dynamic selfish routing game, chosen because it admits a simple equilibrium both in theory and in human subject experiments \cite{rapoport_choice_2009}. Despite the game's relative simplicity, we find that there are key dependencies of LLM agent behavior on the natural language state representation. In particular, we observe that representations which provide agents with (1) summarized, rather than complete, natural language representations of past history; (2) information about regrets, rather than raw payoffs; and (3) limited information about others' actions lead to behavior that more closely matches game theoretic equilibrium predictions, and with more stable game play by the agents. By contrast, other representations can exhibit either large deviations from equilibrium, higher variation in dynamic game play over time, or both.
大型语言模型(LLMs)在动态环境中作为决策者表现出潜力,但它们无状态的特性需要创建自然语言的历史表示。我们提出了一种统一框架,用于系统地构建自然语言“状态”表示,以提示重复多代理游戏中的LLM代理。之前关于使用LLM代理的游戏的工作采用了临时编码游戏历史的方法,这不仅模糊了状态表示对代理行为的影响,还限制了不同研究之间的可比性。我们的框架通过沿三个维度来表征状态表示方法:行动信息量(即,状态表示捕获所玩动作的程度);奖励信息量(即,状态表示描述获得的报酬的程度);以及提示风格(或自然语言压缩,即完整文本历史被总结的程度),解决了这些差距。 我们将此框架应用于动态自私路由游戏,选择该游戏是因为它在理论和人类主体实验中都允许简单的均衡\cite{rapoport_choice_2009}。尽管该游戏相对简单,但我们发现LLM代理行为的关键依赖性在于自然语言状态表示。具体而言,我们观察到提供给代理的以下三种类型的表示会导致其行为更接近博弈论中的均衡预测,并且与代理动态游戏玩法更加稳定:(1) 提供总结过的,而不是完整的自然语言过去历史;(2) 有关遗憾的信息,而非原始收益信息;以及 (3) 对他人行动有限的信息。 相比之下,其他表示可能会表现出偏离均衡的大偏差、时间变化中动态博弈行为的更高变异度,或者两者兼有。
https://arxiv.org/abs/2506.15624
Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like "quite" and "very." To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.
跨文化沟通中的误解往往源于对词语细微解读差异,但这些差异是源自字面意义的分配还是更普遍的实际因素如礼貌和简洁规范尚不清楚。本文中,我们报告了三项实验,研究说英式英语和美式英语的人如何理解诸如“quite”(相当)和“very”(非常)这样的加强词。为了更好地了解这些跨文化差异,我们开发了一个计算认知模型,在这个模型中,听众反复推理出讲者在信息量、礼貌性以及话语成本之间进行权衡的方式。我们的模型比较表明,跨文化的强化词解读差异源于以下两者的结合:(1) 不同的字面意义;(2) 对话语成本的不同权重分配。这些发现挑战了仅基于语义变化或礼貌规范的解释,证明跨文化理解上的差异是从两者之间复杂的相互作用中产生的。
https://arxiv.org/abs/2506.15623
Fairness in machine learning (ML) has a critical importance for building trustworthy machine learning system as artificial intelligence (AI) systems increasingly impact various aspects of society, including healthcare decisions and legal judgments. Moreover, numerous studies demonstrate evidence of unfair outcomes in ML and the need for more robust fairness-aware methods. However, the data we use to train and develop debiasing techniques often contains biased and noisy labels. As a result, the label bias in the training data affects model performance and misrepresents the fairness of classifiers during testing. To tackle this problem, our paper presents Graph-based Fairness-aware Label Correction (GFLC), an efficient method for correcting label noise while preserving demographic parity in datasets. In particular, our approach combines three key components: prediction confidence measure, graph-based regularization through Ricci-flow-optimized graph Laplacians, and explicit demographic parity incentives. Our experimental findings show the effectiveness of our proposed approach and show significant improvements in the trade-off between performance and fairness metrics compared to the baseline.
机器学习(ML)中的公平性对于建立值得信赖的机器学习系统至关重要,因为人工智能(AI)系统对社会各个方面的影响力日益增大,包括医疗决策和法律判决。此外,大量研究表明,机器学习中存在不公平的结果,并且需要更强大的能够识别并减轻不公平性的方法。然而,我们用于训练和开发去偏技术的数据通常包含偏差和噪声标签。因此,训练数据中的标签偏差会影响模型性能,并在测试过程中误导分类器的公平性表现。 为了应对这一问题,我们的论文提出了一种基于图的公平感知标签校正(GFLC)方法,这是一种有效的方法,在修正标签噪声的同时保持数据集的人口统计学均衡。具体来说,我们的方法结合了三个关键组成部分:预测置信度测量、通过里奇流优化的图拉普拉斯算子进行的基于图的正则化以及明确的人口统计学均衡激励。 实验结果表明我们提出的方法的有效性,并且在性能和公平性指标之间的权衡上相比基线方法有了显著改进。
https://arxiv.org/abs/2506.15620
Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.
在大型语言模型中,后悔机制指的是当这些模型生成的错误信息被证据反驳时,它们能够明确表达出来的后悔或纠正。研究这种后悔机制对于提高模型的可靠性至关重要,并有助于揭示神经网络中的认知编码方式。为了理解这一机制,我们首先需要识别出模型输出中的后悔表达,然后分析其内部表示形式。这项分析要求检查模型的隐藏状态,在这个过程中信息处理发生在神经元层面。 然而,这面临着三大挑战: 1. 缺乏专门捕捉后悔表达的数据集; 2. 缺乏用于寻找最优后悔表征层的度量标准; 3. 没有可以用来识别和分析后悔神经元的标准度量方法。 为了应对这些限制,我们提出了以下解决方案: 1. 构建一个全面的后悔数据集的工作流程,通过设计策略性的提示场景来完成。 2. Supervised Compression-Decoupling Index (S-CDI) 度量标准以确定最优后悔表征层的位置。 3. Regret Dominance Score (RDS) 度量标准用于识别后悔神经元,并且使用 Group Impact Coefficient (GIC) 来分析激活模式。 我们的实验结果成功地利用 S-CDI 度量标准识别出了最优的后悔表示层,在探针分类实验中显著提升了性能。此外,我们还发现了一种模型层级中的 M 形解耦模式,揭示了信息处理过程如何在耦合和解耦阶段之间交替进行。通过 RDS 度量标准,我们将神经元分为三类不同的功能组:后悔神经元、非后悔神经元以及双功能神经元。 这种方法不仅有助于我们更深入地理解大型语言模型内部的工作原理,也为我们改进这些系统的性能提供了新途径。
https://arxiv.org/abs/2506.15617
This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.
本文介绍了TTSOps,这是一个完全自动化的闭环框架,用于从嘈杂且未经整理的网络规模语音数据(通常称为“暗数据”)中构建多说话人文本到语音(TTS)系统,例如在线视频。传统的TTS训练流水线需要高质量声学特性和准确的文字-语音对齐的精心策划语料库,这严重限制了其可扩展性、说话人的多样性以及实际应用能力。虽然最近的研究提出了基于音质的数据选择技术,但它们往往忽视两个关键方面:(1)现代TTS模型对噪声的内在鲁棒性;(2)低感知质量却具有信息价值样本的潜在贡献。 为了解决这些问题,TTSOps引入了一个以数据为中心的训练流水线,整合了三个核心组件:(1)从暗数据源自动收集数据;(2)根据训练数据的质量动态选择话语级的数据清理方法;以及(3)使用基于预测的平均意见评分(MOS)进行闭环内的话语选取评估,以估计每个话语对模型性能的影响。此外,TTSOps通过在闭环框架中动态调整数据选择和数据清理过程来联合优化语料库和TTS模型,以便适应目标TTS模型的特点。 在日本YouTube数据上进行了广泛的实验,结果表明TTSOps在合成语音的自然性和说话人多样性方面均优于传统的基于音质的数据基线。
https://arxiv.org/abs/2506.15614
Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.
开放词汇的三维物体检测由于其在自动驾驶和具身人工智能中的关键应用而引起了广泛关注。现有的检测方法,无论是离线还是在线方法,通常依赖于密集点云重建,这会带来巨大的计算开销和内存限制,阻碍了在下游任务中实现实时部署。为了解决这个问题,我们提出了一种新颖的无重构在线框架,该框架针对内存效率高且能实现实时3D检测进行了优化。具体来说,在给定连续输入的位置标注RGB-D视频的情况下,我们将Cubify Anything作为单视图三维物体检测(通过边界框)的预训练视觉基础模型(VFM)使用,并结合CLIP来捕捉检测到的对象的开放词汇语义信息。 为了将不同视角中所有检测出的边界框融合成一个统一的结果,我们采用了一个关联模块来处理多视角之间的对应关系以及一个优化模块用于融合同一实例在多个视图中的三维边界框。该关联模块利用了3D非极大值抑制(NMS)和一个边界框对应匹配模块;而优化模块则使用基于粒子滤波的IoU引导高效随机优化技术,以确保跨多视角的三维边界框的一致性,并尽量减少计算复杂度。 在ScanNetV2和CA-1M数据集上的大量实验表明,我们的方法在在线方法中实现了最先进的性能。得益于这一新颖的无重构3D物体检测范式,我们的方法展示出了在各种场景中的强大泛化能力,甚至能够在超过1000平方米的环境中实现实时感知。
https://arxiv.org/abs/2506.15610
Task-Oriented Grasping (TOG) presents a significant challenge, requiring a nuanced understanding of task semantics, object affordances, and the functional constraints dictating how an object should be grasped for a specific task. To address these challenges, we introduce GRIM (Grasp Re-alignment via Iterative Matching), a novel training-free framework for task-oriented grasping. Initially, a coarse alignment strategy is developed using a combination of geometric cues and principal component analysis (PCA)-reduced DINO features for similarity scoring. Subsequently, the full grasp pose associated with the retrieved memory instance is transferred to the aligned scene object and further refined against a set of task-agnostic, geometrically stable grasps generated for the scene object, prioritizing task compatibility. In contrast to existing learning-based methods, GRIM demonstrates strong generalization capabilities, achieving robust performance with only a small number of conditioning examples.
任务导向抓取(TOG)提出了一个重大挑战,需要对任务语义、物体功效以及决定特定任务中如何抓取物体的功能约束有细微的理解。为了解决这些挑战,我们引入了GRIM(通过迭代匹配进行抓取再定位),这是一种新颖的无需训练的任务导向抓取框架。最初,使用几何线索和主成分分析(PCA)降维后的DINO特征相结合的方法开发了一种粗略对齐策略来进行相似度评分。随后,将检索到的记忆实例相关的完整抓取姿态转移到与场景对象对齐的对象上,并进一步针对为该场景对象生成的一组任务无关、几何稳定的抓取进行细化,优先考虑任务兼容性。与现有的基于学习的方法不同,GRIM展示了强大的泛化能力,在仅有少量条件示例的情况下仍能实现稳健的性能。
https://arxiv.org/abs/2506.15607
Large Language Models (LLMs) have become indispensable in real-world applications. However, their widespread adoption raises significant safety concerns, particularly in responding to socially harmful questions. Despite substantial efforts to improve model safety through alignment, aligned models can still have their safety protections undermined by subsequent fine-tuning - even when the additional training data appears benign. In this paper, we empirically demonstrate that this vulnerability stems from the sensitivity of safety-critical low-rank subspaces in LLM parameters to fine-tuning. Building on this insight, we propose a novel training-free method, termed Low-Rank Extrapolation (LoX), to enhance safety robustness by extrapolating the safety subspace of an aligned LLM. Our experimental results confirm the effectiveness of LoX, demonstrating significant improvements in robustness against both benign and malicious fine-tuning attacks while preserving the model's adaptability to new tasks. For instance, LoX leads to 11% to 54% absolute reductions in attack success rates (ASR) facing benign or malicious fine-tuning attacks. By investigating the ASR landscape of parameters, we attribute the success of LoX to that the extrapolation moves LLM parameters to a flatter zone, thereby less sensitive to perturbations. The code is available at this http URL.
大型语言模型(LLMs)已成为现实世界应用中不可或缺的工具。然而,它们的广泛应用引发了一系列安全问题,特别是在回答可能带来社会危害的问题时。尽管在通过对齐改善模型安全性方面做出了大量努力,但已对齐的模型的安全防护仍可能因后续微调而被破坏——即使额外训练的数据看似无害。在这篇论文中,我们实证展示了这一脆弱性源于大型语言模型参数中的关键安全低秩子空间对微调的高度敏感性。基于这一洞见,我们提出了一种新颖的无需重新训练的方法,称为低秩外推法(LoX),通过外推已对齐LLM的安全子空间来增强安全性鲁棒性。实验结果证实了LoX的有效性,在抵御良性和恶意微调攻击的同时,保持模型在新任务上的适应能力。例如,使用LoX可以实现11%到54%的绝对成功率(ASR)降低,对抗良性或恶意微调攻击。通过考察参数的成功率地形图,我们认为LoX成功的原因在于外推将LLM参数移动到了更平坦的区域,从而使其对扰动不那么敏感。代码可在此网址获取。
https://arxiv.org/abs/2506.15606
While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.
虽然选择题(MCQ)在学习和评估中很有价值,但手动创建不同难度层次且针对特定阅读技能的选择题仍然是一项耗时且成本高昂的任务。近年来,生成式人工智能的发展为高效自动化选择题的生成提供了机会。然而,关于生成的选择题的实际质量和可靠性问题却未得到足够的关注——尤其是在生成失败的情况下。当生成的选择题应用于真实场景中时,这一方面变得尤为重要。此外,大多数选择题生成研究都集中于英语领域,而其他语言则鲜有探索。本文探讨了当前生成式模型在葡萄牙语阅读理解中的选择题生成能力,作为一种形态丰富的语言。我们的研究重点是生成与课程相关的叙述元素相契合且涵盖不同难度层次的选择题,并通过专家评审和分析学生答案的心理测量属性来评估这些选择题是否适合小学教育。研究表明,目前的模型能够生成质量可与人工编写的选择题媲美的选择题;然而,我们发现了一些关于语义清晰度和作答可能性的问题。此外,在生成吸引学生的干扰项并满足高质量选择题选项设计标准方面仍存在挑战。
https://arxiv.org/abs/2506.15598
In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at this https URL.
在临床实践中,具备功能特性的成像模式(如正电子发射断层扫描(PET)和分数各向异性(FA))通常需要与结构参考图像(例如MRI或CT)对齐以进行准确的解读或组分析,这需要多模态可变形图像配准(DIR)技术。然而,由于这些功能成像模式相对于标准结构扫描具有极大的异质性,传统的无监督DIR方法难以学习可靠的空域映射,并且经常会扭曲图像。我们发现指导这些模型相似度度量的方法无法捕捉到高度不同模态之间的对齐。 为解决这一问题,我们提出了M2M-Reg(Multi-to-Mono Registration),这是一种新型框架,使用单一模式的相似性来训练多模式DIR模型,同时保持已建立的架构范式以无缝集成现有模型中。此外,我们还引入了GradCyCon,一种正则化器,利用M2M-Reg的循环训练方案促进微分同胚(diffeomorphism)。我们的框架自然地扩展到了半监督设置下,仅整合预先对齐和未对齐的配对数据,并且无需真实转换或分割掩模。 在阿尔茨海默病神经成像倡议(ADNI)数据集上的实验表明,M2M-Reg相较于先前的方法,在PET-MRI及FA-MRI配准中实现了高达两倍的DSC(Dice相似系数),突显了其在处理高度异质性多模式DIR方面的有效性。我们的代码可在提供的链接处获取。
https://arxiv.org/abs/2506.15596
Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
文档是保存和传播信息的基础,通常包含复杂的布局、表格和图表,这些都给自动文档理解(DU)带来了重大挑战。尽管视觉-语言大型模型(VLLMs)在各种任务上表现出改进,但它们处理长上下文视觉输入的有效性仍然不清楚。本文介绍了WikiMixQA,这是一个基准测试集,包括1000个多项选择题(MCQ),旨在评估跨模态推理能力,这些题目基于从涵盖七个不同主题的4000个维基百科页面中提取的表格和图表进行设计。与现有的基准相比,WikiMixQA通过要求模型综合来自多种模式的信息来强调复杂的推理过程。 我们对12种最先进的视觉-语言模型进行了评估,结果表明,在提供直接上下文的情况下,专有模型可以达到约70%的准确性;然而,当需要从长文档中检索信息时,这些模型的表现显著下降。在这一设置下,只有GPT-4-o这款模型的准确率超过了50%,而开源模型表现较差,最高仅能达到27%的准确率。 这些发现强调了跨模态推理和处理长上下文所带来的挑战,并确立WikiMixQA作为推进文档理解研究的重要基准测试。
https://arxiv.org/abs/2506.15594
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at this https URL.
在现实世界的视频超分辨率(Real-VSR)中,尤其是在利用预训练的生成模型如稳定扩散(Stable Diffusion, SD)进行逼真细节合成时,再现丰富的空间细节同时保持时间一致性是一个具有挑战性的问题。现有的基于SD的Real-VSR方法通常为了保证时间连贯性而牺牲了空间细节,导致视觉质量不佳。我们认为关键在于如何有效地从低质量输入视频中提取鲁棒的时间一致性先验,并在不破坏这些先验的情况下增强视频的细节。 为此,我们提出了一种双LoRA学习(DLoRAL)范式来训练一个基于SD的一步扩散模型,在实现现实帧细节的同时保持时间一致性。具体而言,我们引入了一个跨帧检索(Cross-Frame Retrieval, CFR)模块,用于聚合不同帧之间的互补信息,并训练一个一致性LoRA(Consistency-LoRA, C-LoRA),以从退化输入中学习鲁棒的时间表示。在完成一致性学习之后,我们将CFR和C-LoRA固定下来,并训练一个细节LoRA(Detail-LoRA, D-LoRA)来增强空间细节,同时与C-LoRA定义的时间空间对齐,从而保持时间连贯性。这两个阶段交替进行优化,共同提供一致且细节丰富的输出。在推理过程中,两个LoRA分支被合并到SD模型中,允许以一步扩散的方式高效和高质量地恢复视频。 实验表明,DLoRAL在准确性和速度上都表现出强大的性能。代码和模型可以在这个链接中找到:[此URL](请将此处的"this https URL"替换为实际可用的具体网址)。
https://arxiv.org/abs/2506.15591
Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: this https URL
视觉语言模型(VLMs)现在生成的是以对话层面、多句式描述为主的视觉描述,这挑战了最初为单句描述到场景图映射设计的文本场景图解析器。当前的方法通常通过合并句子级别的解析输出来处理对话输入,但这种方式常常遗漏跨句指代等现象,导致产生的图碎片化,并且影响下游VLM任务的表现。 为了应对这一挑战,我们引入了一个新的任务——话语级文本场景图解析(DiscoSG),并为此构建了数据集DiscoSG-DS。该数据集包括400个专家注释的和8,430对合成的多句描述与对应场景图配对的数据。每个描述平均包含9句话,而每个图表至少包含了现有数据集中三倍以上的三元组数量。 虽然在DiscoSG-DS上微调大型PLM(如GPT-4)可以使SPICE评分比最佳句子合并基线提高约48%,但由于高昂的推理成本和限制性许可条款阻碍了其开源使用,而且较小规模的微调模型难以处理复杂的图。我们提出了一种名为DiscoSG-Refiner的方法:首先利用一个小一点的语言模型生成基础图;然后采用第二个语言模型进行迭代式的图编辑建议,从而减少整个图表生成的工作量。 通过使用两个Flan-T5-Base模型,DiscoSG-Refiner仍能比最佳基线提高约30%的SPICE评分,并且推理速度比GPT-4快86倍。此外,在对话级描述评估和幻觉检测等下游VLM任务上也表现出稳定地提升性能。 代码与数据可在以下网址获取:[请提供具体的URL链接]
https://arxiv.org/abs/2506.15583
Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.
估算森林地上生物量(AGB)对于评估碳储存和支持可持续森林管理至关重要。定量结构模型(QSM)通过3D树木结构重建提供了一种非破坏性的AGB估算方法。然而,现有的QSM方法面临重大限制,因为它们主要是为单一树木设计的,依赖于地面激光扫描(TLS)提供的高质量点云数据,并且需要多个预处理步骤,这阻碍了其可扩展性和实际部署。本研究提出了一种新颖的一体化框架,该框架能够使用创新的基于图的流水线对大规模点云进行端到端处理。所提出的这种方法通过专门的图操作(包括路径规划和抽象)无缝地集成了树木分割、叶木分离以及3D骨架重建。在具有不同叶片条件(带叶和无叶)、空间尺度(单树级和地块级)及数据来源(TLS和基于无人机的激光扫描,ULS)的数据集上进行了全面验证。 实验结果显示,在挑战性条件下性能强大,特别是在带叶场景中相对误差约为20%,以及在低密度ULS数据集中部分覆盖情况下的相对误差约为30%。这些发现表明,所提出的框架为大规模、非破坏性的AGB估算提供了稳健且可扩展的解决方案,大大减少了对专门预处理工具的依赖,并确立了ULS作为TLS的有效替代方案。据我们所知,这是第一个能够在操作规模上实现无缝端到端3D树木重建的方法。 这一进步极大地提高了基于QSM的AGB估算的实际可行性,为森林调查和气候变化研究中的更广泛应用铺平了道路。
https://arxiv.org/abs/2506.15577
We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.
我们介绍SciVer,这是首个专门设计用于评估基础模型在多模态科学背景下验证声明能力的基准测试。SciVer包含3000个由专家标注的例子,这些例子源自1,113篇科学论文,并覆盖了四种不同的子集,每个子集代表多模态科学研究中常见的推理类型。为了进行细致的评估,每一个例子都包含了由专家标注的支持证据。我们对21种最先进的多模态基础模型进行了性能评估,其中包括o4-mini、Gemini-2.5-Flash、Llama-3.2-Vision和Qwen2.5-VL。我们的实验结果显示,这些模型在SciVer上的表现与人类专家之间存在显著差距。通过深入分析增强检索生成(Retrieval-Augmented Generation, RAG)以及人工错误评估,我们识别出了当前开源模型中的关键限制,并为提升模型在多模态科学文献任务中的理解和推理能力提供了重要见解。
https://arxiv.org/abs/2506.15569
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.
我们提出了一种关于大型语言模型(LLM)性别公平性的全面评估,重点考察它们处理二元和非二元性别的能力。尽管以往的研究主要集中在二元性别区分上,我们引入了性别包容性公平指数(GIFI),这是一个新颖且全面的度量标准,用于量化LLM在不同性别标识下的多样性与包容性。GIFI涵盖了从简单地用提供的性别代词来探测模型到测试模型在不同性别假设下生成和认知行为的各种方面的广泛评估,揭示了与各种性别标识相关的偏见。我们在22个不同的开源和专有LLM上进行了广泛的GIFI评估,这些模型的大小和能力各异,发现它们在性别包容性方面存在显著差异。我们的研究强调了提高LLM包容性的必要性,并为未来生成模型中性别公平性的进展提供了重要的基准。
https://arxiv.org/abs/2506.15568