Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at this https URL.
图像中的比喻理解仍然是AI系统的重大挑战,因为现有的模型难以把握视觉内容中嵌入的细腻的文化、情感和上下文含义。尽管多模态大型语言模型(MLLMs)在基本的视觉问答(VQA)任务上表现出色,但在涉及图像内涵的任务方面仍面临一个根本性的限制:即不同视觉元素之间关系及其抽象意义所造成的上下文差距。 受人类认知过程启发,我们提出了一种新的框架——让机器人产生梦境(Let Androids Dream, LAD),旨在理解和推理图像的隐含含义。LAD通过三阶段框架解决上下文缺失的问题:(1)感知:将视觉信息转换为丰富且多层次的文本表示;(2)搜索:迭代地搜索和整合跨域知识以消除歧义;以及(3)推理:通过明确推理生成与背景相符的图像隐含含义。使用轻量级GPT-4o-mini模型,我们的框架在英语图像隐含基准测试中相较于15个以上的MLLMs达到了最先进的性能,并在中国语料库的测试中取得了巨大进步,在多项选择题(MCQ)和开放式风格问题(OSQ)上分别与GPT-4o模型表现相当并超越了后者36.7%。此外,我们的工作为AI如何更有效地解释图像隐含含义提供了新的见解,推动了视觉语言推理及人机交互领域的发展。 本项目已在公开网址上发布:[此链接](https://thishttpsURL.com/)(请将“this https URL”替换为您实际的项目地址)。
https://arxiv.org/abs/2505.17019
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
多模态大型语言模型(MLLMs)在问答任务中取得了令人印象深刻的成功,但它们的空间理解能力却鲜有探索。本研究探讨了一个关键问题:现有的MLLM是否具备三维空间感知和理解的能力?具体而言,本文作出了以下贡献: (i) 我们引入了VGBench,这是一个专门设计的基准测试工具,用于评估MLLM在视觉几何感知方面的能力,例如相机姿态和运动估计; (ii) 我们提出了SpatialScore,这是迄今为止最为全面且多样化的多模态空间理解基准测试,它将VGBench与来自其他11个现有数据集的相关数据进行了整合。该基准测试包含了28,000多个样本,涵盖了各种空间理解任务、模式以及问答格式,并包含了一个精心策划的挑战性子集SpatialScore-Hard; (iii) 我们开发了SpatialAgent,这是一个新型多代理系统,集成有9种专门的空间理解工具,支持计划-执行和反思行动(ReAct)推理范式; (iv) 我们进行了广泛评估,揭示了空间推理中持久存在的挑战,并展示了SpatialAgent的有效性。 我们认为,SpatialScore将提供宝贵的见解,并作为下一代MLLM演进的严格基准。
https://arxiv.org/abs/2505.17012
LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4\% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47\% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
基于大型语言模型(LLM)的多智能体系统(MAS)通过允许多个专业化代理之间的合作,扩展了单一LLM的能力。然而,大多数现有的MAS框架依赖于单个LLM来驱动所有代理,从而限制了系统的智能水平到该模型的极限。本文探讨了一种异构大型语言模型驱动的多智能体系统(X-MAS)范式,在这种系统中,各个代理由不同的大型语言模型提供动力,将系统的潜力提升到了多样化的大型语言模型集体智慧的高度。我们介绍了X-MAS-Bench,这是一个全面的测试平台,旨在评估各种LLM在不同领域和MAS相关功能上的表现。作为一项广泛的经验研究,我们在五个领域(涵盖21个测试集)和五种功能上对27种不同的LLM进行了超过170万次评估,以识别每个域-功能组合的最佳模型选择。基于这些发现,我们展示了从同质到异构大型语言模型驱动的多智能体系统的转变可以在不进行结构性重新设计的情况下显著提升系统性能。具体而言,在仅限于聊天机器人的MAS场景中,异构配置在MATH数据集上的表现提高了最多8.4%。在一个混合了聊天机器人和推理者的场景中,异构MAS在AIME数据集上实现了令人瞩目的47%的表现提升。我们的结果强调了异构大型语言模型在多智能体系统中的变革潜力,并为开发可扩展、协作的人工智能系统开辟了一条前景光明的道路。
https://arxiv.org/abs/2505.16997
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{this https URL}{this https URL}.
大型语言模型(LLMs)在各种软件工程任务中展现了强大的能力,例如代码补全、错误修复和文档生成。然而,特性驱动开发(FDD),这是一种高度流行的真实世界任务,涉及到为庞大的现有代码库添加新功能,这一领域仍然被较少探索。为此,我们引入了SWE-Dev,这是首个大规模数据集(包含14,000个训练样本和500个测试样本),旨在评估和训练自动编码系统在真实世界的特性开发任务上的表现。为了确保可验证且多样化的训练过程,SWE-Dev独特地为所有实例提供了一个运行环境及其由开发者编写的执行单元测试。 这个数据集不仅提供了高质量的数据用于监督微调(SFT),而且还通过提供来自可执行单元测试的准确奖励信号支持强化学习(RL)。我们在SWE-Dev上进行了广泛评估,涵盖了17个聊天机器人LLM、10个推理模型和10个多智能体系统(MAS),发现FDD是当前AI面临的深刻挑战前沿(例如,Claude-3.7-Sonnet在困难测试分割上的Pass@3仅达到22.45%)。至关重要的是,我们展示了SWE-Dev作为一个有效的模型改进平台的作用:在训练集上进行微调使一个70亿参数的模型在“困难”分段的表现可媲美GPT-4o,强调了其高质量训练数据的价值。 代码可以在[\href{this https URL}{此处}]获取。
https://arxiv.org/abs/2505.16975
Existing medical VQA benchmarks mostly focus on single-image analysis, yet clinicians almost always compare a series of images before reaching a diagnosis. To better approximate this workflow, we introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA. To build MedFrameQA both at scale and in high-quality, we develop 1) an automated pipeline that extracts temporally coherent frames from medical videos and constructs VQA items whose content evolves logically across images, and 2) a multiple-stage filtering strategy, including model-based and manual review, to preserve data clarity, difficulty, and medical relevance. The resulting dataset comprises 2,851 VQA pairs (gathered from 9,237 high-quality frames in 3,420 videos), covering nine human body systems and 43 organs; every question is accompanied by two to five images. We comprehensively benchmark ten advanced Multimodal LLMs -- both proprietary and open source, with and without explicit reasoning modules -- on MedFrameQA. The evaluation challengingly reveals that all models perform poorly, with most accuracies below 50%, and accuracy fluctuates as the number of images per question increases. Error analysis further shows that models frequently ignore salient findings, mis-aggregate evidence across images, and propagate early mistakes through their reasoning chains; results also vary substantially across body systems, organs, and modalities. We hope this work can catalyze research on clinically grounded, multi-image reasoning and accelerate progress toward more capable diagnostic AI systems.
现有的医学VQA(视觉问答)基准测试大多集中在单一图像分析上,然而,临床医生在得出诊断之前几乎总是要比较一系列的影像。为了更好地模拟这一工作流程,我们推出了MedFrameQA——这是首个明确评估多图推理能力的医学VQA基准测试。为了在规模和高质量两个方面构建MedFrameQA,我们开发了1)一个自动化管道,该管道从医疗视频中提取时间上连贯的画面,并构造出内容在图像间逻辑演化的VQA项目;2)一个多阶段筛选策略,包括基于模型的和人工审查的方式,以保持数据清晰度、难度以及医学相关性。生成的数据集包含2,851对VQA问题(从9,237张高质量帧中的3,420个视频中提取),涵盖九大人体系统及43个器官;每个问题都配有两到五幅图像。 我们全面评估了十种先进的多模态LLM(大型语言模型)——无论是专有的还是开源的,包括那些具有明确推理模块和没有推理模块的情况。在MedFrameQA上的评估结果挑战性地揭示出所有模型的表现都很差,大多数准确率低于50%,并且随着每个问题中图像数量的增加,准确性波动显著。错误分析进一步显示,模型经常忽略显而易见的发现、在图片之间错误汇总证据,并且通过推理链传播早期的错误;结果也因人体系统、器官和模态的不同而有较大差异。 我们希望这项工作能够推动基于临床实践的多图推理研究,并加速更强大的诊断AI系统的开发进程。
https://arxiv.org/abs/2505.16964
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
个人可识别信息(PII)的匿名化是一项高风险任务,为许多开放科学数据共享倡议设置了障碍。虽然近年来在PII识别方面已经取得了显著进展,但在实践中,错误阈值和召回率/精确度权衡仍然限制了这些匿名化管道的应用推广。我们提出了一个名为PIIvot的轻量级框架,该框架利用对数据上下文的理解来简化PII检测问题。为了证明其有效性,我们也贡献了一个名为QATD-2k的数据集,这是同类中最大的开源现实世界辅导数据集,以支持高质量教育对话数据的需求。
https://arxiv.org/abs/2505.16931
We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks that advances long-context understanding in embodied AI. $\infty$-THOR provides: (1) a generation framework for synthesizing scalable, reproducible, and unlimited long-horizon trajectories; (2) a novel embodied QA task, Needle(s) in the Embodied Haystack, where multiple scattered clues across extended trajectories test agents' long-context reasoning ability; and (3) a long-horizon dataset and benchmark suite featuring complex tasks that span hundreds of environment steps, each paired with ground-truth action sequences. To enable this capability, we explore architectural adaptations, including interleaved Goal-State-Action modeling, context extension techniques, and Context Parallelism, to equip LLM-based agents for extreme long-context reasoning and interaction. Experimental results and analyses highlight the challenges posed by our benchmark and provide insights into training strategies and model behaviors under long-horizon conditions. Our work provides a foundation for the next generation of embodied AI systems capable of robust, long-term reasoning and planning.
我们介绍了$\infty$-THOR,这是一个新的框架,旨在处理具身任务中的长时间跨度问题,并在具身人工智能中推进长上下文理解。$\infty$-THOR提供了以下内容: 1. 一个生成框架,用于合成可扩展、可重复且无限的长时间跨度轨迹; 2. 一个新的具身问答任务,“针在具身干草堆里”,其中遍布于延长轨迹中的多个散落线索测试代理的长上下文推理能力; 3. 一套包含复杂任务的长时间跨度数据集和基准套件,每个任务跨越数百个环境步骤,并配以真实动作序列。 为了实现这一功能,我们探索了架构调整,包括交错的目标-状态-行动建模、上下文扩展技术以及上下文并行性,以便为基于大语言模型(LLM)的代理提供极端长上下文推理和交互的能力。实验结果和分析突显了我们的基准带来的挑战,并提供了关于长时间跨度条件下训练策略及模型行为的见解。 我们这项工作为下一代能够进行稳健、长期推理与规划的具身人工智能系统奠定了基础。
https://arxiv.org/abs/2505.16928
Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
大型语言模型(LLM)在长文本生成中容易出现幻觉,尤其是当它们缺乏足够的知识时。减轻这一问题的一个有前景的方向是教会这些模型在其知识不足的情况下明确表达不确定性。然而,现有的研究工作缺少对LLM在长文本生成中有效表达不确定性的直接和公平的评估方法。为了弥补这一空白,我们首次引入了UNCLE基准测试,旨在评估不确定性表达能力,涵盖长格式和短格式问题回答(QA)。UNCLE涉及五个领域,并包含4000个长格式问答实例以及超过20,000对短格式问答对。我们的数据集是首个直接连接长短形式QA的集合,其中包含配对的问题和标准答案。 除了基准测试之外,我们还提出了一系列新指标来评估模型选择性表达不确定性的能力。通过使用UNCLE,我们展示了当前模型在长文本生成中未能适当传达不确定性这一问题。此外,我们进一步探索了基于提示和训练的方法以提高模型性能,并发现基于训练的方法取得了更显著的改进。 对短格式与长格式之间不确定性表达的一致性差距进行深入分析后,该研究为未来使用UNCLE开展的研究指出了有前景的方向。
https://arxiv.org/abs/2505.16922
Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
幻觉——尽管合理但错误的输出——依然是大规模语言模型(LLM)可靠部署的关键障碍。我们首次系统研究了由增量上下文注入引起的内部状态漂移与幻觉发生率之间的联系。使用TruthfulQA,我们在每个问题上构建两个16轮“滴定”轨道:一个附加相关但部分有缺陷的片段,另一个则注入故意误导的内容。在六个开源LLM中,我们利用三视角检测器跟踪显性幻觉率,并通过余弦、熵、JS和斯皮尔曼漂移分析隐藏状态及注意力图的变化来追踪隐性动态变化。研究结果揭示了以下几点: 1. 幻觉频率与表示漂移随轮次增加而单调增长,在5-7轮后达到平台期。 2. 相关上下文驱动语义深入吸收,产生高置信度的“自我一致”幻觉;而不相关上下文则通过注意力重新定向导致主题漂错。 3. JS漂移(约0.69)与斯皮尔曼漂移(接近于零)的收敛标志着一个“注意力锁定”的阈值,在此之后,幻觉固化并变得难以纠正。 相关性分析揭示了吸收能力和注意力扩散之间的跷跷板效应,澄清了大小依赖型错误模式。这些发现为内在幻觉预测和上下文感知缓解机制提供了实证基础。
https://arxiv.org/abs/2505.16894
Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.
持续后期训练使单一的文本到图像扩散模型能够在不增加单独模型成本的情况下学习新任务,但简单的后期训练会导致遗忘预训练知识,并损害零样本组合性。我们观察到缺乏标准化的评估协议阻碍了相关研究的发展,特别是对于持续后期训练的研究。为此,我们引入了T2I-ConBench,这是一个用于文本到图像模型持续后期训练的统一基准测试平台。T2I-ConBench专注于两个实际场景:项目定制和领域增强,并从四个维度进行分析:(1)通用性保留;(2)目标任务性能;(3)灾难性遗忘;以及(4)跨任务泛化能力。它结合了自动化指标、人类偏好建模及视觉-语言问答,以进行全面评估。我们对十种代表性方法进行了三个实际的任务序列基准测试,并发现没有一种方法在所有方面都表现出色。即使联合“oracle”训练也不适用于每个任务,而跨任务的泛化问题仍未解决。我们发布了所有数据集、代码和评估工具,以加速文本到图像模型持续后期训练的研究进展。
https://arxiv.org/abs/2505.16875
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at this https URL
Chain-of-Thought (CoT)推理通过实现逐步问题解决来增强大型语言模型(LLM),然而,其扩展到Long-CoT引入了由于增加的标记长度而产生的显著计算开销。现有的压缩方法——实例级和标记级——要么牺牲诸如反思等重要的局部推理信号,要么产生不连贯的输出。为了解决这些局限性,我们提出了R1-Compress,这是一种两阶段的块级压缩框架,能够同时保留局部信息与连贯性。我们的方法将Long-CoT分割成可管理的小块,在每个小块内部应用LLM驱动的压缩,并采用跨小块搜索机制来选择简短且连贯的序列。在Qwen2.5-Instruct模型上的实验结果显示,R1-Compress在减少标记使用量的同时保持了相当的推理准确性,分别在MATH500、AIME24和GPQA-Diamond数据集上进行了测试。在MATH500数据集中,R1-Compress实现了92.4%的准确率,相较于Long-CoT基线仅下降了0.6%,并且减少了约20%的标记使用量。 源代码将在以下URL发布:[请在此处插入实际链接]
https://arxiv.org/abs/2505.16838
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: this https URL
近年来,具身人工智能(Embodied AI)得到了快速发展,但主要仍局限于实验室环境之中。现实世界中的种种限制因素制约了其应用范围。传统上,图像质量评估(IQA)方法被用来预测人类对失真图像的偏好;然而,并不存在能够评估图像在具身任务中可用性的IQA方法,即机器人感知质量的评估手段。为了为未来具身场景提供准确可靠的指标,我们首先提出了“面向具身AI的图像质量评估”这一主题。 具体而言: 1. 基于默顿体系和元认知理论,构建了感知-认知-决策-执行流程,并定义了一套全面的主观评分收集方法。 2. 创建了包含超过36,000对参考/失真图像、由视觉语言模型/视觉语言行动模型/现实世界机器人提供的超过5百万条精细注释的具身-IQA数据库。 3. 在具身-IQA上训练并验证主流IQA方法的表现,展示了需要开发更准确的质量指标来适应具身AI的需求。 我们真诚地希望通过评估工作能够促进在复杂失真环境下的现实世界中应用具身人工智能。项目页面:[此链接](this https URL)
https://arxiv.org/abs/2505.16815
Batteries are essential for various applications, including electric vehicles and renewable energy storage, making safety and efficiency critical concerns. Anomaly detection in battery thermal images helps identify failures early, but traditional deep learning methods require extensive labeled data, which is difficult to obtain, especially for anomalies due to safety risks and high data collection costs. To overcome this, we explore zero-shot anomaly detection using Visual Question Answering (VQA) models, which leverage pretrained knowledge and textbased prompts to generalize across vision tasks. By incorporating prior knowledge of normal battery thermal behavior, we design prompts to detect anomalies without battery-specific training data. We evaluate three VQA models (ChatGPT-4o, LLaVa-13b, and BLIP-2) analyzing their robustness to prompt variations, repeated trials, and qualitative outputs. Despite the lack of finetuning on battery data, our approach demonstrates competitive performance compared to state-of-the-art models that are trained with the battery data. Our findings highlight the potential of VQA-based zero-shot learning for battery anomaly detection and suggest future directions for improving its effectiveness.
电池对于电动汽车和可再生能源存储等各类应用至关重要,因此安全性和效率成为了关键问题。在电池热图像中进行异常检测有助于提前发现故障,但传统的深度学习方法需要大量标注数据,这些数据由于安全性风险及高昂的数据采集成本而难以获得。为解决这一难题,我们探索了利用视觉问答(VQA)模型进行零样本异常检测的方法,这种方法通过使用预训练的知识和基于文本的提示来在不同的视觉任务中实现泛化。结合正常的电池热行为先验知识,我们设计出可以不依赖于特定电池数据训练的提示以识别异常。 我们在三个VQA模型(ChatGPT-4o、LLaVa-13b 和 BLIP-2)上进行了评估,分析了它们对不同提示变化的鲁棒性以及重复实验的结果和定性输出。尽管这些模型没有针对电池数据进行微调,但我们的方法展示了与最先进的已训练电池数据的模型相比具有竞争力的表现。本研究结果突显了基于VQA的零样本学习在电池异常检测中的潜力,并提出了未来改进其有效性的方向。
https://arxiv.org/abs/2505.16674
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text model to the navigation agent. Through lightweight fine-tuning on a small 2D-3D-text corpus, the navigation agent learns to integrate visual cues with spatial-semantic knowledge derived from the 3D-text model, enabling effective reasoning in embodied navigation. CoNav achieves significant improvements on four standard embodied navigation benchmarks (R2R, CVDN, REVERIE, SOON) and two spatial reasoning benchmarks (ScanQA, SQA3D). Moreover, under close navigation Success Rate, CoNav often generates shorter paths compared to other methods (as measured by SPL), showcasing the potential and challenges of fusing data from different modalities in embodied navigation. Project Page: this https URL
嵌入式导航需要全面的场景理解和精确的空间推理能力。虽然图像-文本模型擅长解读像素级别的颜色和光照线索,而3D-文本模型则能捕捉体积结构和空间关系。然而,联合融合2D图像、3D点云以及文本指令的数据统一融合方法在处理三模态数据稀缺性和解决不同模式之间冲突信念的难题时面临挑战。为此,我们引入了CoNav,这是一个协作跨模态推理框架,在此框架中,预训练的3D-文本模型通过提供结构化的空间语义知识来明确指导图像-文本导航代理,从而在导航过程中解决模糊性问题。 具体而言,我们提出了跨模态信念对齐方法,该方法通过简单地从3D-文本模型共享文字假设给导航代理来实现这种跨模态引导。经过轻量级的微调,在一个小型2D-3D-文本语料库上训练后,导航代理可以学习将视觉线索与从3D-文本模型衍生的空间语义知识结合起来,从而在嵌入式导航中进行有效推理。 CoNav在四个标准嵌入式导航基准(R2R, CVDN, REVERIE, SOON)和两个空间推理基准(ScanQA, SQA3D)上取得了显著的改进。此外,在接近导航成功率的情况下,与其它方法相比(通过SPL测量),CoNav通常生成更短的路径,展示了在嵌入式导航中融合不同模态数据的能力及其面临的挑战。 项目主页:[此链接](https://this-url)
https://arxiv.org/abs/2505.16663
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
我们介绍了一种针对制药领域的日语特定领域语言模型,该模型通过在20亿个日语文本的医药标记和80亿个英语生物医学标记上进行持续预训练而开发。为了能够严格评估模型性能,我们引入了三个新的基准测试:YakugakuQA,基于国家药剂师执业资格考试;NayoseQA,用于跨语言同义词和术语规范化测试;以及SogoCheck,一个新颖的任务设计用于评估成对语句之间的一致性推理。我们在开源医学LLM(大型语言模型)和商业模型(包括GPT-4o)上对该模型进行了评估。结果显示,我们的特定领域模型在现有开放模型中表现更佳,并且在术语密集型和知识基础任务中与商用模型的性能相当甚至超越。有趣的是,即使是GPT-4o在SogoCheck上的表现也相对较差,这表明跨句子一致性推理仍然是一个待解决的技术难题。 我们的基准测试套件为医药NLP(自然语言处理)提供了一个更为全面的诊断视角,涵盖事实回忆、词汇变化和逻辑一致性。这项工作展示了构建实用、安全且成本效益高的日语文本领域应用的语言模型是可行的,并为未来在制药和医疗保健NLP领域的研究提供了可重复使用的评估资源。 我们的模型、代码和数据集已在此网址发布:[此URL]。
https://arxiv.org/abs/2505.16661
Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
最近在多模态大型语言模型(MLLMs)方面的进展显著提升了视觉问答的性能,然而这些模型经常出现幻觉问题。在这项工作中,我们将幻觉分为两大类:初始幻觉和滚雪球式幻觉。我们认为可以通过直接从令牌交互过程中提取足够的上下文信息来解决这个问题。受因果推理在解码策略中的启发,我们提出了利用因果掩码建立多模态令牌之间的信息传播的方法。我们的假设是,如果这些令牌之间缺乏充分的互动,模型可能会依赖于离群令牌,从而忽略密集且丰富的上下文线索。 因此,通过处理离群令牌来加强上下文推理的过程,我们将提出一种干预方法以增强信息传递过程。为此,我们提出了FarSight,这是一种灵活的即插即用解码策略,仅通过对因果掩码进行优化即可减少来自离群令牌的关注干扰。我们的方法的核心是有效的令牌传播。我们在因果掩码的上三角矩阵内设计了一个注意力注册结构,并动态地分配关注以捕捉被吸引到离群令牌上的注意力。 此外,我们还提出了一种基于递减屏蔽率的位置感知编码方法,使模型能够注意更早之前的令牌,特别是在视频序列任务中尤其重要。通过广泛的实验,FarSight在不同MLLM上的一系列图像和视频基准测试中均显示出显著的幻觉缓解性能,证明了其有效性。
https://arxiv.org/abs/2505.16652
We present a novel approach to Chest X-ray (CXR) Visual Question Answering (VQA), addressing both single-image image-difference questions. Single-image questions focus on abnormalities within a specific CXR ("What abnormalities are seen in image X?"), while image-difference questions compare two longitudinal CXRs acquired at different time points ("What are the differences between image X and Y?"). We further explore how the integration of radiology reports can enhance the performance of VQA models. While previous approaches have demonstrated the utility of radiology reports during the pre-training phase, we extend this idea by showing that the reports can also be leveraged as additional input to improve the VQA model's predicted answers. First, we propose a unified method that handles both types of questions and auto-regressively generates the answers. For single-image questions, the model is provided with a single CXR. For image-difference questions, the model is provided with two CXRs from the same patient, captured at different time points, enabling the model to detect and describe temporal changes. Taking inspiration from 'Chain-of-Thought reasoning', we demonstrate that performance on the CXR VQA task can be improved by grounding the answer generator module with a radiology report predicted for the same CXR. In our approach, the VQA model is divided into two steps: i) Report Generation (RG) and ii) Answer Generation (AG). Our results demonstrate that incorporating predicted radiology reports as evidence to the AG model enhances performance on both single-image and image-difference questions, achieving state-of-the-art results on the Medical-Diff-VQA dataset.
我们提出了一种新颖的胸部X光(CXR)视觉问答(VQA)方法,该方法能同时处理单张图像和不同时间点获取的两幅图像之间的差异问题。单张图像的问题关注特定CX射线中的异常情况(“在图像X中可以看到哪些异常?”),而图像差异问题是对比两张在不同时段获得的纵向CXR图像(“图像X和Y之间有什么区别?”)。我们进一步探讨了将放射科报告集成到VQA模型中以提升性能的方法。虽然之前的研究已经证明,在预训练阶段使用放射科报告是有益的,但我们扩展这一思路,表明这些报告还可以作为额外输入来提高VQA模型预测答案的质量。 首先,我们提出了一种统一方法,能够处理两种类型的问题并自回归地生成答案。对于单张图像问题,模型仅接收一张CX射线;而对于图像差异问题,则提供来自同一患者的两张在不同时间点拍摄的CXR图像,使模型能够检测和描述随时间的变化。 受“链式思维推理”的启发,我们证明了通过使用预测的放射科报告来支撑答案生成模块可以提升CXR VQA任务的表现。我们的方法将VQA模型分为两个步骤:i)报告生成(RG)和ii)答案生成(AG)。实验结果表明,在Medical-Diff-VQA数据集上,将预测得到的放射科报告作为证据融入到AG模型中能够显著提高单张图像问题及图像差异问题的表现,并达到最先进的水平。
https://arxiv.org/abs/2505.16624
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at this https URL .
我们介绍了一项名为KoLasSimpleQA的基准测试,这是首个评估大型语言模型(LLMs)多语种事实能力的标准。受到现有研究的启发,我们创建了具有单一知识点覆盖、绝对客观性、独特答案和时间稳定性等特征的问题集。这些问题使得使用“将LLM作为裁判”的方法进行高效评估成为可能,同时测试了LLMs的事实记忆能力和自我认知(即知道自己不知道什么)。KoLasSimpleQA在两个关键维度上扩展了现有研究: 1. 广度(多语言覆盖):它包括9种语言,支持全球适用性评估。 2. 深度(双领域设计):涵盖了通用领域(全球事实)和特定语言的领域(如历史、文化和区域传统),对多语种能力进行全面评价。 我们评估了主流LLMs,包括传统的大型语言模型和新兴的大规模推理模型。结果显示,在两个领域中存在显著的性能差异,特别是在性能指标、排名、校准和鲁棒性方面。这突显了在多语种环境中进行有针对性的评估和优化的需求。我们希望KoLasSimpleQA能够帮助研究界更好地识别LLM在多语言环境中的能力边界,并为模型优化提供指导。我们将在此 [链接] (https://this.url/) 发布 KoLasSimpleQA。
https://arxiv.org/abs/2505.16591
Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
大型语言模型(LLMs),尽管取得了显著的进步,但仍然受到静态参数知识的限制,在处理需要最新开放领域信息的任务时表现不佳。虽然让LLMs能够与外部知识环境互动是解决这一问题的一个有前景的方法,目前的努力主要集中在封闭型问题上。而开放式问题,其特点是缺乏标准答案或提供非唯一且多样化的答案,则仍然未被充分探索。 为弥合这个差距,我们提出了O$^2$-Searcher,这是一种新型的搜索代理,利用强化学习有效地解决开放领域中的开放式和封闭式问题。O$^2$-Searcher利用一个高效的本地模拟搜索环境来获取动态知识,从而将外部世界的知识与模型复杂的推理过程解耦。它采用了一种统一的训练机制,并设计了精心考虑的奖励函数,使代理能够识别问题类型并适应不同的答案生成策略。 此外,为了评估在复杂开放式任务上的性能,我们构建了O$^2$-QA,这是一个高质量的基准测试集,包含300个手动策划的跨领域开放式问题及相应的网页缓存。广泛的实验表明,使用仅3B参数模型的O$^2$-Searcher在O$^2$-QA上的表现明显优于领先的LLM代理。同时,在各种封闭式问答基准测试中,它也取得了与同样大小的模型相比的最佳结果,并且性能可媲美更大规模的模型。
https://arxiv.org/abs/2505.16582
Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
小型大语言模型(简称sLLMs)因其轻量级和高效的特点,在资源受限的环境中表现出色。然而,这些模型在面向任务的对话系统中通常难以保持话题一致性,特别是在服务聊天机器人等场景下,这一问题尤为重要。具体而言,确保模型能够拒绝无关或恶意输入并坚持其预期功能以防止潜在滥用及保证可靠性至关重要。为此,现有激活工程方法提出通过调整推理过程中的内部激活来解决此类问题。尽管这些方法在某些情况下表现出有效性,但我们的初步实验表明它们在维持话题一致性方面存在局限性。 因此,为了应对这一挑战,我们提出了一个名为熵缩放引导向量维护主题(EnSToM)的新方法。EnSToM根据输入的不确定性动态调整控制强度,使得模型能够有效处理无关干扰信息的同时保持与主题相关的准确性。实验结果表明,相较于微调方法,EnSToM在相对较小的数据集上实现了显著的性能提升。 通过提高话题一致性而不牺牲效率,我们的方法为增强基于sLLM的对话系统提供了一个稳健的解决方案。
https://arxiv.org/abs/2505.16526