Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.
现有的评估视频语言模型在时空理解与推理能力方面的基准容易受到基于表面视觉或文本线索的捷径解决方案的影响,从而导致评分膨胀。为了解决准确评估模型性能所面临的挑战,本文引入了Minimal Video Pairs(MVP)基准测试,这是一个简单的、具有意识的捷径解决方案的视频问答基准测试,用于评估视频语言模型对物理世界的理解能力。该基准包含55,000个高质量的选择题形式的视频问答示例,着重于物理世界理解方面。这些样本来自九种不同的视频数据来源,包括第一人称视角的视频、第三人称视角的视频、机器人互动数据以及认知科学中的直观物理学基准测试。 为了减少依赖表面视觉或文本线索和偏差的捷径解决方案的影响,MVP中的每个样本都有一个最小变化对——即一个与原始视频在视觉上非常相似但答案相反的问题。要正确回答问题,模型必须为最小变化对中的两个示例都提供正确的答案;因此,仅依靠视觉或文本偏见进行推理的模型将表现得不如随机猜测好。 人类在MVP上的表现为92.9%,而最先进的开源视频语言模型的表现仅为40.2%(相比之下,随机性能为25%)。
https://arxiv.org/abs/2506.09987
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]
https://arxiv.org/abs/2506.09958
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.
在外部知识视觉问答(OK-VQA)中,模型必须识别图像中的相关视觉信息,并结合外部知识以准确回答问题。将这项任务扩展到基于视频的视觉支持对话设置中,对话模型不仅要能够随时间推断出相关的视觉细节,还要能回答那些所需信息不一定存在于视觉信息中的问题。此外,整体对话的上下文也必须被考虑在内。 为了探索这一任务,我们引入了一个数据集,包含2,017个视频和5,986个人工标注的对话,这些对话包括40,954轮交替进行的对话回合。虽然对话背景是基于特定视频片段中的视觉信息,但问题进一步需要那些不在视觉中直接显示的外部知识。因此,模型不仅必须识别相关视频部分,还要利用外部知识在对话中交流。 我们还在我们的数据集上提供了一些基准测试,并展示了与此任务相关的未来挑战。该数据集可在此公开获取:[此URL](this https URL)。
https://arxiv.org/abs/2506.09953
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
我们介绍了一个名为CausalVQA的基准数据集,该数据集用于视频问答(Video Question Answering, VQA),其中包含探查模型对物理世界因果关系理解的问题-答案对。现有的VQA基准要么侧重于实际视频表面感知的理解,要么专注于使用模拟环境创建的狭窄物理推理问题。CausalVQA通过提出基于现实场景、挑战性的五类问题(反事实、假设性、预期、计划和描述性),填补了这一重要空白,这些问题聚焦于模型预测不同行动和事件可能结果的能力。我们设计了质量控制机制以防止模型利用简单的捷径,要求模型的答案必须建立在深层视觉理解之上而非语言线索。 我们发现当前前沿的多模态模型在该基准上的表现显著低于人类水平,特别是在预期和假设性问题上。这表明目前系统面临着如何充分利用时空推理能力、物理原理的理解以及对可能替代方案的理解来做出准确预测的挑战,尤其是在现实世界场景中。
https://arxiv.org/abs/2506.09943
Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.
开发能够理解三维场景并根据自然语言指令执行广泛任务的三维视觉-语言(3D-VL)通才,一直是3D-VL社区长期追求的目标。尽管近期取得了进展,但3D-VL模型在能力和稳健性方面仍落后于其二维对应模型,并未能达到通才的标准。开发3D-VL通才的关键障碍在于数据规模的扩展问题,而这一问题又受到缺乏高效场景表示方法的限制。我们提出了LEO-VL,这是一个基于凝缩特征网格(CFG)构建的三维视觉-语言模型,CFG是一种高效的场景表示方式,它在桥接二维感知和三维空间结构的同时显著减少了令牌开销。这种效率为大规模训练3D-VL通才打开了大门,为此我们整理了超过70万条高质量的涵盖四个真实室内场景领域和五个任务(如描述生成和对话)的数据集。 LEO-VL在各种3D问答基准测试中均达到了最先进的性能水平,包括SQA3D、MSQA和Beacon3D。通过消融研究确认了我们表示法的有效性、任务与场景多样性的重要性以及数据整理原则的合理性。此外,我们引入了一种新颖的事后训练目标——SceneDPO,旨在增强3D-VL模型的稳健性。 希望我们的研究成果能够为开发可扩展和稳健的三维视觉-语言通才贡献一份力量。
https://arxiv.org/abs/2506.09935
The Segment Anything Model 2 (SAM2) has gained significant attention as a foundational approach for promptable image and video segmentation. However, its expensive computational and memory consumption poses a severe challenge for its application in resource-constrained scenarios. In this paper, we propose an accurate low-bit quantization method for efficient SAM2, termed Q-SAM2. To address the performance degradation caused by the singularities in weight and activation distributions during quantization, Q-SAM2 introduces two novel technical contributions. We first introduce a linear layer calibration method for low-bit initialization of SAM2, which minimizes the Frobenius norm over a small image batch to reposition weight distributions for improved quantization. We then propose a Quantization-Aware Training (QAT) pipeline that applies clipping to suppress outliers and allows the network to adapt to quantization thresholds during training. Our comprehensive experiments demonstrate that Q-SAM2 allows for highly accurate inference while substantially improving efficiency. Both quantitative and visual results show that our Q-SAM2 surpasses existing state-of-the-art general quantization schemes, especially for ultra-low 2-bit quantization. While designed for quantization-aware training, our proposed calibration technique also proves effective in post-training quantization, achieving up to a 66% mIoU accuracy improvement over non-calibrated models.
段落翻译如下: Segment Anything Model 2(SAM2)作为一种可提示的图像和视频分割基础方法,已经获得了广泛关注。然而,其高昂的计算和内存消耗给资源受限场景的应用带来了严峻挑战。在本文中,我们提出了一种针对高效SAM2的有效低比特量化方法,称为Q-SAM2。为了解决量化过程中权重和激活分布中的奇异值所导致的性能下降问题,Q-SAM2引入了两项创新技术贡献。 首先,我们介绍了一种线性层校准方法,用于在低比特环境下初始化SAM2,并通过最小化小批量图像上的弗罗贝尼乌斯范数来重新定位权重分布以优化量化效果。其次,我们提出了一种量化感知训练(QAT)管道,该管道应用剪辑操作抑制异常值,并允许网络在训练过程中适应量化阈值。 我们的全面实验表明,Q-SAM2能够在显著提高效率的同时提供高度准确的推理结果。无论是定量还是定性结果都显示,与现有的最先进的通用量化方案相比,特别是对于极低比特(如2位)量化的情况下,我们的Q-SAM2表现更优。尽管该校准技术是为量化感知训练而设计的,但它在非训练后量化中也表现出色,相较于未进行校准模型,在mIoU准确性上提升了高达66%。
https://arxiv.org/abs/2506.09782
It is important for Large Language Models to be aware of the boundary of their knowledge, the mechanism of identifying known and unknown queries. This type of awareness can help models perform adaptive inference, such as invoking RAG, engaging in slow and deep thinking, or adopting the abstention mechanism, which is beneficial to the development of efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which aims to determine if the model is able to address a given query without generating any tokens. To this end, we introduce a novel and training-free method called \emph{Internal Confidence}, which leverages self-evaluations across layers and tokens. Empirical results on both factual QA and mathematical reasoning tasks demonstrate that our internal confidence can outperform several baselines. Furthermore, we showcase that our proposed method can be used for efficient RAG and model cascading, which is able to reduce inference costs while maintaining performance.
大型语言模型了解其知识边界并能够识别已知和未知查询的机制是非常重要的。这种意识有助于模型进行自适应推理,比如调用检索增强生成(RAG)、开展慢而深入的思考或采用弃权机制,这有利于高效且可信的人工智能的发展。在这项工作中,我们提出了一种通过查询级别的不确定性来检测知识边界的的方法,旨在确定模型是否能够在不生成任何标记的情况下解决给定的查询。为此,我们引入了一个新颖且无需训练的方法——“内部置信度”,它利用了跨层和标记的自我评估。 在事实问答和数学推理任务上的实证结果表明,我们的内部置信度方法能够超越多个基准线。此外,我们展示了所提出的方法可以用于高效的RAG和模型级联,从而降低推断成本的同时保持性能水平。
https://arxiv.org/abs/2506.09669
This paper presents a system developed for SemEval 2025 Task 8: Question Answering (QA) over tabular data. Our approach integrates several key components: text-to-SQL and text-to-code generation modules, a self-correction mechanism, and a retrieval-augmented generation (RAG). Additionally, it includes an end-to-end (E2E) module, all orchestrated by a large language model (LLM). Through ablation studies, we analyzed the effects of different parts of our pipeline and identified the challenges that are still present in this field. During the evaluation phase of the competition, our solution achieved an accuracy of 80%, resulting in a top-13 ranking among the 38 participating teams. Our pipeline demonstrates a significant improvement in accuracy for open-source models and achieves a performance comparable to proprietary LLMs in QA tasks over tables. The code is available at GitHub repository.
本文介绍了一个为SemEval 2025任务8(表格数据上的问答系统)开发的系统。我们的方法集成了几个关键组件:文本转SQL和文本转代码生成模块、自我校正机制以及检索增强生成(RAG)。此外,该方法还包括一个端到端(E2E)模块,并由大型语言模型(LLM)进行协调。通过消融实验,我们分析了管道中不同部分的影响,并识别出了当前领域仍然存在的挑战。在比赛的评估阶段,我们的解决方案实现了80%的准确率,在38支参赛队伍中排名前13位。我们的管道对开源模型的准确性有了显著提升,并且在表格上的问答任务中的表现与专有LLM相当。代码可在GitHub仓库中获取。
https://arxiv.org/abs/2506.09657
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: this https URL.
大型语言模型(LLMs)在各个领域展示了强大的归纳推理能力,但其可靠性受到知识过时和幻觉问题的限制。检索增强生成通过利用外部知识解决了这些问题;然而,现有的大多数RAG管道依赖于非结构化文本,这限制了可解释性和结构化的推理过程。知识图谱(用关系三元组表示事实)提供了一种更结构化且紧凑的替代方案。最近的研究探讨了将知识图谱与LLMs结合以进行知识图查询回答(KGQA),其中许多研究采用“检索后推理”的方法。在此框架下,基于图的方法展现了强大的实证性能,但仍面临泛化能力方面的挑战。 本文提出了一种新的框架RAPL,旨在有效和高效地在KGQA中执行图谱检索。通过三个方面解决了这些限制:(1)一种两阶段标记策略,结合了启发式信号与参数模型以提供因果基础的监督;(2)一种模型无关的知识图变换方法,捕捉到三元组内部及之间的交互作用,从而提高表示能力;以及(3)基于路径推理策略,促进从注入的合理知识中学习,并通过结构化输入支持下游推理器。实证上,RAPL在性能方面超越了现有最佳方法2.66%至20.34%,显著缩小了较小和更强大的LLM推理者之间的性能差距以及跨数据集设置下的性能差距,突显其卓越的检索能力和泛化能力。 代码可在以下网址获取:[此链接](this https URL)。
https://arxiv.org/abs/2506.09645
Automated 3D CT diagnosis empowers clinicians to make timely, evidence-based decisions by enhancing diagnostic accuracy and workflow efficiency. While multimodal large language models (MLLMs) exhibit promising performance in visual-language understanding, existing methods mainly focus on 2D medical images, which fundamentally limits their ability to capture complex 3D anatomical structures. This limitation often leads to misinterpretation of subtle pathologies and causes diagnostic hallucinations. In this paper, we present Hybrid Spatial Encoding Network (HSENet), a framework that exploits enriched 3D medical visual cues by effective visual perception and projection for accurate and robust vision-language understanding. Specifically, HSENet employs dual-3D vision encoders to perceive both global volumetric contexts and fine-grained anatomical details, which are pre-trained by dual-stage alignment with diagnostic reports. Furthermore, we propose Spatial Packer, an efficient multimodal projector that condenses high-resolution 3D spatial regions into a compact set of informative visual tokens via centroid-based compression. By assigning spatial packers with dual-3D vision encoders, HSENet can seamlessly perceive and transfer hybrid visual representations to LLM's semantic space, facilitating accurate diagnostic text generation. Experimental results demonstrate that our method achieves state-of-the-art performance in 3D language-visual retrieval (39.85% of R@100, +5.96% gain), 3D medical report generation (24.01% of BLEU-4, +8.01% gain), and 3D visual question answering (73.60% of Major Class Accuracy, +1.99% gain), confirming its effectiveness. Our code is available at this https URL.
自动化3D CT诊断赋能临床医生通过提高诊断准确性和工作流程效率,及时做出基于证据的决策。尽管多模态大型语言模型(MLLM)在视觉-语言理解方面表现出色,但现有方法主要关注二维医学图像,这从根本上限制了它们捕捉复杂三维解剖结构的能力。这种局限性常常导致对细微病理特征的误解,并引发诊断错误。在这篇论文中,我们提出了Hybrid Spatial Encoding Network (HSENet),这是一个利用有效视觉感知和投影丰富3D医学视觉线索以实现准确且鲁棒的视觉-语言理解的框架。具体而言,HSENet采用双三维视觉编码器来感知全局体积上下文以及精细解剖细节,并通过双重阶段对齐与诊断报告进行预训练。此外,我们提出了一种高效的多模态投影器——Spatial Packer,该模型通过基于质心压缩将高分辨率3D空间区域浓缩为一系列信息丰富的视觉标记。通过将Spatial Packer与双三维视觉编码器相结合,HSENet可以无缝地感知和转移混合视觉表示到LLM的语义空间中,从而促进准确的诊断文本生成。实验结果显示,我们的方法在3D语言-视觉检索(R@100为39.85%,提升率为+5.96%)、3D医学报告生成(BLEU-4得分为24.01%,提升率为+8.01%)和3D视觉问答(Major Class Accuracy得分73.60%,提升率为+1.99%)中均达到了最先进的性能,证实了其有效性。我们的代码可在该链接获得。
https://arxiv.org/abs/2506.09634
Retrieval-Augmented Generation (RAG) improves factual accuracy by grounding responses in external knowledge. However, existing methods typically rely on a single source, either unstructured text or structured knowledge. Moreover, they lack cognitively inspired mechanisms for activating relevant knowledge. To address these issues, we propose KG-Infused RAG, a framework that integrates KGs into RAG systems to implement spreading activation, a cognitive process that enables concept association and inference. KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts, enabling interpretable, multi-source retrieval grounded in semantic structure. We further improve KG-Infused RAG via preference learning on sampled key stages in the pipeline. Experiments on five QA benchmarks show that KG-Infused RAG consistently outperforms vanilla RAG (by 3.8% to 13.8%). Additionally, when integrated into Self-RAG, KG-Infused RAG brings further performance gains, demonstrating its effectiveness and versatility as a plug-and-play enhancement module for corpus-based RAG methods.
检索增强生成(RAG)通过将响应基于外部知识来提高事实准确性。然而,现有方法通常依赖单一来源,无论是非结构化文本还是结构化知识,并且缺乏认知启发式的机制来激活相关知识。为解决这些问题,我们提出了KG-Infused RAG框架,该框架将知识图谱(KGs)整合到RAG系统中,以实现信息扩散激活过程,这一认知过程能够促进概念关联和推理。KG-Infused RAG检索KG事实、相应扩展查询,并通过结合语料库段落与结构化事实来增强生成过程,从而使多源检索具有可解释性和基于语义结构。 我们进一步通过对管道中采样的关键阶段进行偏好学习来改进KG-Infused RAG。在五个问答基准上的实验表明,KG-Infused RAG始终优于原始RAG(性能提升3.8%至13.8%)。此外,在与Self-RAG集成时,KG-Infused RAG还能带来进一步的性能增益,证明了其作为基于语料库RAG方法插件增强模块的有效性和灵活性。
https://arxiv.org/abs/2506.09542
Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.
尽管基于推理的大规模语言模型(LLMs)在数学和编程方面表现出色,但在知识密集型医学问答方面的能力仍然未被充分探索。为了解决这一问题,我们引入了ReasonMed,这是迄今为止最大的医疗推理数据集,包含从各种LLM生成的170万条初始推理路径中提炼出的37万个高质量示例。ReasonMed通过一个多代理验证和精炼过程构建而成,在此过程中,我们设计了一个“错误精炼器”,用于识别并纠正由验证器标记为存在问题的步骤,以增强推理路径。 借助ReasonMed数据集,我们系统地研究了训练医疗推理模型的最佳实践,并发现结合详细的链式思维(CoT)推理和简洁的答案摘要可以产生最有效的微调策略。基于这一策略,我们训练出了ReasonMed-7B,在规模低于100亿参数的模型中树立了新的标杆,比之前的最佳成绩提高了4.17%,甚至在PubMedQA数据集上超越了LLaMA3.1-70B(一个超过700亿参数的大模型)达4.60%。
https://arxiv.org/abs/2506.09513
In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which explores policies to fuse multi-modal information and adaptively select adequate demonstrations as an integrated whole. The framework allows LVLMs to optimize themselves by continually refining their demonstrations through self-exploration, enabling the ability to autonomously identify and generate the most effective selection policies for in-context learning. Experimental results verify the superior performance of our approach on four Visual Question-Answering (VQA) datasets, demonstrating its effectiveness in enhancing the generalization capability of few-shot LVLMs.
在上下文学习(ICL)这一主导的指令学习趋势中,通过提供明确的任务指导和示例来增强大型语言模型的表现力,提高它们在任务理解和执行方面的能力。本文研究了大规模视觉-语言模型(LVLM)中的ICL,并探索多模态展示选择策略。现有的ICL研究工作面临重大挑战:首先,它们依赖于预定义的演示或基于人类直觉的手工选择策略,这些通常不足以涵盖多样化的任务需求,导致次优解决方案;其次,单独选择每个演示无法建模它们之间的相互作用,从而造成信息冗余。与这些流行的方法不同,我们提出了一种新的探索-利用强化学习框架,该框架旨在探索融合多模态信息并适应性地整体选择足够演示的策略。这个框架允许LVLM通过自我探索不断优化其展示,具备自主识别和生成最佳上下文学习选择政策的能力。实验结果验证了我们的方法在四个视觉问答(VQA)数据集上的优越性能,证明了它能够增强少样本LVLMs的一般化能力。
https://arxiv.org/abs/2506.09473
We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
我们解决的是在弱监督环境下,视频问答(video QA)中的时间定位问题,而无需任何时间标注。给定一个视频和一个问题时,我们会生成基于开始时间和结束时间的开放式答案。为了解决这个任务,我们提出了TOGA:一种用于弱监督下的、具有时间定位的开放性视频问答的视觉-语言模型(Temporally Grounded Open-Ended Video QA)。通过指令微调,TOGA能够同时生成答案和时间定位信息。我们在时间标注不可用的弱监督环境中操作,并为时间定位生成伪标签,以确保这些标签的有效性,我们通过对同一时间段的问题回答之间的一致性约束来实现这一点。 我们注意到,与单独生成答案或时间定位相比,同时生成二者可以提高问答及定位性能。我们在有依据的问答和开放式问答任务上评估TOGA的表现。对于基于证据的问答,我们考虑NExT-GQA基准测试,该测试旨在评估弱监督下的基于证据的问题回答能力。而对于开放性问答,我们参考MSVD-QA和ActivityNet-QA这两个基准。在这些基准测试中,我们的方法在两项任务上均取得了最先进的性能表现。
https://arxiv.org/abs/2506.09445
Knowledge Graph Question Answering (KGQA) is a crucial task in natural language processing that requires reasoning over knowledge graphs (KGs) to answer natural language questions. Recent methods utilizing large language models (LLMs) have shown remarkable semantic parsing capabilities but are limited by the scarcity of diverse annotated data and multi-hop reasoning samples. Traditional data augmentation approaches are focus mainly on single-hop questions and prone to semantic distortion, while LLM-based methods primarily address semantic distortion but usually neglect multi-hop reasoning, thus limiting data diversity. The scarcity of multi-hop samples further weakens models' generalization. To address these issues, we propose PGDA-KGQA, a prompt-guided generative framework with multiple data augmentation strategies for KGQA. At its core, PGDA-KGQA employs a unified prompt-design paradigm: by crafting meticulously engineered prompts that integrate the provided textual content, it leverages LLMs to generate large-scale (question, logical form) pairs for model training. Specifically, PGDA-KGQA enriches its training set by: (1) generating single-hop pseudo questions to improve the alignment of question semantics with KG relations; (2) applying semantic-preserving question rewriting to improve robustness against linguistic variations; (3) employing answer-guided reverse path exploration to create realistic multi-hop questions. By adopting an augment-generate-retrieve semantic parsing pipeline, PGDA-KGQA utilizes the augmented data to enhance the accuracy of logical form generation and thus improve answer retrieval performance. Experiments demonstrate that outperforms state-of-the-art methods on standard KGQA datasets, achieving improvements on WebQSP by 2.8%, 1.2%, and 3.1% and on ComplexWebQuestions by 1.8%, 1.1%, and 2.4% in F1, Hits@1, and Accuracy, respectively.
知识图谱问答(KGQA)是自然语言处理中的一个关键任务,它要求通过推理知识图谱(KG)来回答自然语言问题。最近使用大型语言模型(LLM)的方法展现出了卓越的语义解析能力,但受限于多样化注释数据和多跳推理样本的稀缺性。传统的数据增强方法主要关注单一跳问句,并且容易导致语义扭曲;而基于LLM的方法则主要是解决语义扭曲问题,通常忽视了多跳推理的需求,因此限制了数据多样性。缺乏多跳样例进一步削弱了模型泛化能力。为了解决这些问题,我们提出了PGDA-KGQA(提示引导的生成框架结合多种数据增强策略用于KGQA)。该方法的核心在于采用统一的提示设计范式:通过精心构造整合提供文本内容的提示语,利用LLM来生成大规模(问题,逻辑形式)对以供模型训练。具体来说,PGDA-KGQA 通过以下方式丰富其训练集: 1. **生成单跳伪问句**:提升问题语义与KG关系之间的对齐度; 2. **应用保持语义的问句重写**:增强语言变异时的鲁棒性; 3. **采用答案引导反向路径探索**:创建现实中的多跳问题。 通过引入增广-生成-检索的语义解析流水线,PGDA-KGQA 利用增强数据提升逻辑形式生成准确性,并进而提高答案检索性能。实验表明,在标准KGQA数据集上,PGDA-KGQA 优于当前最先进的方法:WebQSP在F1、Hits@1和准确率上的改进分别为2.8%、1.2% 和3.1%,ComplexWebQuestions 在这三个指标上的提升为1.8%、1.1% 和2.4%。
https://arxiv.org/abs/2506.09414
Large Language Models (LLMs) have demonstrated impressive performance on multiple-choice question answering (MCQA) benchmarks, yet they remain highly vulnerable to minor input perturbations. In this paper, we introduce and evaluate Token Constraint Decoding (TCD). This simple yet effective inference-time algorithm enforces alignment between token-level predictions to enhance robustness in noisy settings. Through extensive experiments on CommonsenseQA, MMLU, and MMLU-Pro, we show that TCD, especially when paired with prompt engineering (PE) fixes, significantly restores performance degraded by input noise, yielding up to +39\% absolute gains for weaker models like Gemma3 1B. Penalty sweep analyses further reveal that TCD implicitly regularizes overconfident outputs, with different models requiring distinct penalty schedules to maximize resilience. Our findings establish TCD as a practical, model-agnostic approach for improving reasoning stability under real-world imperfections and pave the way for more reliable deployment of LLMs in safety-critical or user-facing applications.
大型语言模型(LLMs)在多项选择题问答(MCQA)基准测试中表现出色,但它们对输入的细微变化仍然非常敏感。本文介绍了并评估了一种称为“Token Constraint Decoding”(TCD)的新方法。这是一种简单而有效的推理时间算法,通过强制执行逐令牌预测的一致性来增强在嘈杂环境下的鲁棒性。通过对CommonsenseQA、MMLU和MMLU-Pro的广泛实验表明,当与提示工程(PE)修复措施结合使用时,TCD能够显著恢复由输入噪声引起的性能下降,特别是在较弱模型如Gemma3 1B中可获得高达+39%绝对收益。罚分扫描分析进一步揭示了TCD可以隐式地对过于自信的输出进行正则化,并且不同的模型需要不同类型的罚分安排来最大化其鲁棒性。我们的研究结果确立了TCD作为在现实世界不完美环境下提高推理稳定性的实用、模型无关的方法,并为大型语言模型在安全关键或面向用户的应用中的可靠部署铺平了道路。
https://arxiv.org/abs/2506.09408
Large language models (LLMs) exhibit remarkable capabilities across diverse tasks, yet aligning them efficiently and effectively with human expectations remains a critical challenge. This thesis advances LLM alignment by introducing novel methodologies in data collection, training, and evaluation. We first address alignment data collection. Existing approaches rely heavily on manually curated datasets or proprietary models. To overcome these limitations, we propose Lion, an adversarial distillation framework that iteratively refines training data by identifying and generating challenging instructions, enabling state-of-the-art zero-shot reasoning. Additionally, we introduce Web Reconstruction (WebR), a fully automated framework that synthesizes instruction-tuning data directly from raw web documents, significantly improving data diversity and scalability over existing synthetic data methods. Next, we enhance alignment training through novel optimization techniques. We develop Learning to Edit (LTE), a framework that enables LLMs to efficiently integrate new knowledge while preserving existing information. LTE leverages meta-learning to improve both real-time and batch knowledge updates. Furthermore, we introduce Bridging and Modeling Correlations (BMC), a refinement of Direct Preference Optimization (DPO) that explicitly captures token-level correlations in preference data, leading to superior alignment across QA and mathematical reasoning tasks. Finally, we tackle the challenge of evaluating alignment. Existing benchmarks emphasize response quality but overlook adherence to specific constraints. To bridge this gap, we introduce FollowBench, a multi-level, fine-grained benchmark assessing LLMs' ability to follow complex constraints across diverse instruction types. Our results expose key weaknesses in current models' constraint adherence, offering insights for future improvements.
大型语言模型(LLM)在各种任务中表现出卓越的能力,但如何有效地将其与人类期望对齐仍然是一个关键挑战。本论文通过引入数据收集、训练和评估方面的创新方法来推进LLM的对齐工作。 首先,我们解决了对齐数据采集的问题。现有方法主要依赖于手动策划的数据集或专有模型。为克服这些限制,我们提出了Lion框架,这是一个对抗性蒸馏框架,能够通过识别并生成具有挑战性的指令,迭代地精炼训练数据,从而实现最先进的零样本推理能力。 此外,我们引入了Web Reconstruction(WebR),一个完全自动化的框架,它直接从原始网络文档中合成指令微调数据,显著提高了现有合成数据方法的数据多样性和可扩展性。 接下来,我们通过新型优化技术增强对齐训练。我们开发了一种称为Learning to Edit (LTE)的框架,该框架能够使LLM在整合新知识的同时保留现有信息。LTE利用元学习来改进实时和批处理的知识更新。 另外,我们还介绍了Bridging and Modeling Correlations(BMC)方法,这是对Direct Preference Optimization(DPO)的一种改进,它明确地捕捉偏好数据中的标记级相关性,在问答任务和数学推理任务中实现了更好的对齐效果。 最后,我们解决了评估对齐的挑战。现有的基准测试重视响应质量但忽略了特定约束条件的遵守情况。为了弥补这一缺口,我们推出了FollowBench,这是一个多层次、细粒度的基准测试,用于评估LLM遵循各种指令类型复杂约束的能力。我们的研究揭示了当前模型在遵守约束方面的关键弱点,并为未来的改进提供了宝贵的见解。 通过这些方法和框架的发展与应用,本论文为进一步优化大型语言模型的行为模式提供了一条新的途径。
https://arxiv.org/abs/2506.09329
Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce MultiNet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over 1.3 trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. The MultiNet benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.
最近在多模态行动模型方面的创新为开发通用代理系统提供了一个有前景的方向,这些系统结合了视觉理解、语言理解和动作生成。我们介绍了 MultiNet——一个新的完全开源的基准和围绕它的软件生态系统,旨在严格评估并适应跨视觉、语言和动作领域的模型。我们建立了标准化的评价协议来评估视觉-语言模型(VLMs)和视觉-语言-行动模型(VLAs),并且提供了开放源代码软件以下载相关数据、模型及评测结果。 此外,我们提供了一个复合数据集,包含超过1.3万亿个标记的数据,涵盖了图像描述、视觉问答、常识推理、机器人控制、数字游戏玩法、模拟行走/操作等任务。MultiNet 基准测试、框架、工具包和评估系统已经被用于下游研究中,以探讨 VLA 通用性的局限性。
https://arxiv.org/abs/2506.09172
As AI chatbots become increasingly integrated in education, students are turning to these systems for guidance, feedback, and information. However, the anthropomorphic characteristics of these chatbots create ambiguity regarding whether students develop trust toward them as they would a human peer or instructor, based in interpersonal trust, or as they would any other piece of technology, based in technology trust. This ambiguity presents theoretical challenges, as interpersonal trust models may inappropriately ascribe human intentionality and morality to AI, while technology trust models were developed for non-social technologies, leaving their applicability to anthropomorphic systems unclear. To address this gap, we investigate how human-like and system-like trusting beliefs comparatively influence students' perceived enjoyment, trusting intention, behavioral intention to use, and perceived usefulness of an AI chatbot - factors associated with students' engagement and learning outcomes. Through partial least squares structural equation modeling, we found that human-like and system-like trust significantly influenced student perceptions, with varied effects. Human-like trust more strongly predicted trusting intention, while system-like trust better predicted behavioral intention and perceived usefulness. Both had similar effects on perceived enjoyment. Given the partial explanatory power of each type of trust, we propose that students develop a distinct form of trust with AI chatbots (human-AI trust) that differs from human-human and human-technology models of trust. Our findings highlight the need for new theoretical frameworks specific to human-AI trust and offer practical insights for fostering appropriately calibrated trust, which is critical for the effective adoption and pedagogical impact of AI in education.
随着AI聊天机器人在教育中的应用日益广泛,学生开始越来越多地求助于这些系统寻求指导、反馈和信息。然而,由于这些聊天机器人的拟人化特征,出现了关于学生是否会像对待人类同伴或教师那样基于人际信任而对它们产生信任,还是会像对待其他任何技术产品一样基于科技信任来对待他们的不确定性和模糊性。这种不确定性带来了理论上的挑战:人际关系中的信任模型可能会不恰当地赋予AI人类意图和道德的属性;同时,科技信任模型则是在非社交技术的基础上建立起来的,因此这些模型是否适用于拟人化系统尚不清楚。 为了填补这一空白,我们研究了类似人的信任信念与类似系统的信任信念如何相对地影响学生对AI聊天机器人的感知愉悦度、信任意愿、使用行为意愿以及感知有用性。这四个因素都与学生的参与程度和学习成果有关。通过部分最小二乘结构方程建模(PLS-SEM),我们发现,拟人化的信任和系统性的信任在不同程度上显著影响了学生对AI聊天机器人的看法。其中,拟人化信任更强烈地预测了信任意愿;而系统性信任则更好地预测使用行为意愿和感知有用性。两者对于感知愉悦度的影响相似。 鉴于每种类型的信任部分解释了学生与AI互动中的现象,我们提出学生们正在形成一种独特的、区别于人类对人类以及人类对技术的信任形式——即“人机信任”。我们的研究结果强调了需要为这种特定的“人机信任”开发新的理论框架,并提供了关于如何培养适当且合理的信任的实际见解。这对于有效利用AI在教育中的应用及其对学生学习成果的影响至关重要。
https://arxiv.org/abs/2506.09160