Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.
在过去几年中,图像生成技术取得了显著进步。然而,评估图像生成模型的性能仍然是一项艰巨的任务。为此,本文提出了ICE-Bench这一统一且全面的基准测试系统,旨在严格评估图像生成模型的能力。其主要特点可以总结为以下几点: 1. **粗粒度到细粒度任务**:我们系统性地将图像生成分解成四个任务类别:无参考/有参考图像创建和编辑,根据是否存在源图或参考图进行分类,并进一步将其细化为涵盖广泛需求的31个具体任务。这些具体任务覆盖了从简单的图像生成到复杂的图像编辑等多个方面的需求,构成了一个全面的基准测试系统。 2. **多维度指标**:评估框架通过六个维度来衡量图像生成的能力:美学质量、成像质量、提示遵循度、源图一致性、参考图一致性和可控性。为了支持这种多维度评估,我们引入了11个不同的评价指标,其中包括VLLM-QA这一创新性的评估方法,该方法利用大型模型评估图像编辑任务的成功率。 3. **混合数据集**:测试的数据来源于真实场景和虚拟生成场景的结合,有效提升了数据多样性,并减少了在模型评估中的偏差问题。通过ICE-Bench,我们对现有的生成模型进行了深入分析,揭示了基准测试的挑战性以及当前模型能力与实际需求之间的差距。 为了促进该领域的进一步发展,我们将开源ICE-Bench,包括其数据集、评估代码和模型等资源,为研究社区提供宝贵的工具和支持。
https://arxiv.org/abs/2503.14482
Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs' ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG's performance, demonstrating its potential beyond traditional RAG applications.
https://arxiv.org/abs/2503.14234
Real dialogues with AI assistants for solving data-centric tasks often follow dynamic, unpredictable paths due to imperfect information provided by the user or in the data, which must be caught and handled. Developing datasets which capture such user-AI interactions is difficult and time-consuming. In this work, we develop a novel framework for synthetically generating controlled, multi-turn conversations between a user and AI assistant for the task of table-based question answering, which can be generated from an existing dataset with fully specified table QA examples for any target domain. Each conversation aims to solve a table-based reasoning question through collaborative effort, modeling one of two real-world scenarios: (1) an AI-initiated clarification, or (2) a user-initiated correction. Critically, we employ a strong teacher LLM to verify the correctness of our synthetic conversations, ensuring high quality. We demonstrate synthetic datasets generated from TAT-QA and WikiTableQuestions as benchmarks of frontier LLMs. We find that even larger models struggle to effectively issuing clarification questions and accurately integrate user feedback for corrections.
与AI助手解决数据相关任务的真实对话通常会因为用户提供的信息或数据中的不完善之处而呈现动态且不可预测的路径,这些需要被捕捉和处理。开发能够捕捉这种用户-人工智能互动的数据集既困难又耗时。在这项工作中,我们提出了一种新颖的方法来合成性地生成控制型多轮对话框架,以解决基于表格的问题回答任务,该方法可以从现有的包含完整规范化的表格问答示例的现有数据集中为任何目标领域生成这些对话。每个会话旨在通过协作努力解答一个基于表格的推理问题,并模拟两种真实世界场景之一:(1)AI主动发起澄清;或(2)用户主动发起修正。 关键的是,我们使用了一个强大的教师级语言模型来验证我们合成对话的正确性,确保高质量的标准。我们展示了从TAT-QA和WikiTableQuestions生成的人工数据集作为前沿大语言模型性能基准。研究发现,即使是更大的模型也难以有效地提出澄清问题并准确地整合用户反馈来进行修正。
https://arxiv.org/abs/2503.14167
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.
工业异常检测(IAD)对于确保制造过程中的产品质量至关重要。尽管现有的零样本缺陷分割和检测方法已经表现出有效性,但它们无法提供有关缺陷的详细描述。此外,在IAD中应用大型多模态模型仍处于起步阶段,面临平衡问答性能与基于掩码的定位能力的挑战,并且这些问题通常由于微调过程中出现过拟合而加剧。 为了解决这些挑战,我们提出了一种新颖的方法,引入了专门用于缺陷定位的多模态模块,将对话功能从核心特征提取中分离出来。通过独立优化目标和量身定制的学习策略实现了这种分离。此外,我们还贡献了一个全新的多模态工业异常检测训练数据集,名为“Defect Detection Question Answering”(DDQA),该数据集涵盖了广泛的缺陷类型及工业场景。不同于依赖GPT生成数据的传统数据集,DDQA确保了真实性和可靠性,并为模型的训练提供了坚实的基础。 实验结果显示,我们提出的可解释性工业异常检测助手(EIAD)在缺陷检测和定位任务中表现出卓越性能。它不仅显著提高了准确性,还增强了模型的可解释性。这些进步凸显了EIAD在未来实际应用中的潜力,特别是在工业环境中。
https://arxiv.org/abs/2503.14162
Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at this https URL.
多模态大型语言模型(MLLM)在文档理解方面引入了一个新颖的维度,即它们赋予大型语言模型视觉理解的能力;然而,如何设计一个合适的图像文本预训练任务来连接文档级MLLM中的视觉和语言模式仍然有待探索。在这项研究中,我们介绍了一种新的视觉-语言对齐方法,将其核心问题视为带有遮罩生成(VQAMask)的任务,优化两个任务:基于VQA的文本解析和掩码生成。前者使模型能够在语义层面隐式地将图像与文本对齐。后者则引入了一个额外的掩码生成器(在推理时被丢弃),以显式确保图像内的视觉文本及其相应空间区域之间的对齐,从而防止在解析视觉文本时出现幻觉,并有效促进空间感知特征学习。 为了支持所提出的VQAMask任务,我们构建了一条全面的图像-遮罩生成流水线并提供了一个大规模的数据集(MTMask6M),其中包含600万数据。接着,我们证明引入了建议的掩码生成任务可以获得竞争性的文档级理解性能。利用提出的VQAMask,我们推出了Marten,这是一种训练效率高的MLLM,专门用于文档级别的理解。大量的实验表明,在以文档为中心的任务中,我们的Marten在8B-MLLMs中始终取得了显著改进。代码和数据集可在提供的链接处获得。
https://arxiv.org/abs/2503.14140
Large Language models have demonstrated excellent domain-specific question-answering capabilities when finetuned with a particular dataset of that specific domain. However, fine-tuning the models requires a significant amount of training time and a considerable amount of hardware. In this work, we propose CARE (Customer Assistance and Response Engine), a lightweight model made by fine-tuning Phi3.5-mini on very minimal hardware and data, designed to handle queries primarily across three domains: telecommunications support, medical support, and banking support. For telecommunications and banking, the chatbot addresses issues and problems faced by customers regularly in the above-mentioned domains. In the medical domain, CARE provides preliminary support by offering basic diagnoses and medical suggestions that a user might take before consulting a healthcare professional. Since CARE is built on Phi3.5-mini, it can be used even on mobile devices, increasing its usability. Our research also shows that CARE performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions.
大型语言模型在特定领域数据集上进行微调后,展示了出色的问题回答能力。然而,这种微调过程需要大量的训练时间和硬件资源。为此,我们提出了CARE(客户服务和响应引擎),这是一个轻量级的模型,在非常有限的硬件和数据条件下对Phi3.5-mini进行了微调,旨在处理电信支持、医疗支持和银行支持这三个领域的查询。 在电信和银行业务中,聊天机器人能够解决客户日常面临的问题。而在医学领域,CARE可以提供初步的支持,例如给出基本诊断和医疗建议,用户可以在寻求专业医疗服务前参考这些建议。 由于CARE是基于Phi3.5-mini构建的,它甚至可以在移动设备上运行,提高了其实用性。我们的研究还表明,CARE在多个医学基准测试中的表现相当不错,显示出它可以用来提供基本的医疗建议。
https://arxiv.org/abs/2503.14136
Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.
交通场景理解对于智能运输系统和自动驾驶至关重要,确保车辆的安全高效运行。虽然最近视觉语言模型(VLMs)在全面场景理解方面取得了显著进展,但将这些技术应用于基于鸟瞰图(BEV)的地图的交通场景中仍处于探索阶段。现有方法通常受限于任务设计狭窄及数据量有限的问题,这阻碍了对复杂交通场景的理解。为了解决这些问题,我们引入了一种新的基准测试ChatBEV-QA,这是一个包含超过137,000个问题的BEV视觉问答(VQA)基准测试,旨在涵盖广泛的场景理解任务,包括全局场景理解、车辆与车道之间的互动以及车辆间的互动。该基准测试通过一个创新的数据收集流程构建而成,能够生成用于BEV地图的大规模和信息丰富的VQA数据。 我们进一步微调了一个专门的视觉语言模型ChatBEV,使其能够解析各种问题提示,并从BEV地图中提取相关且上下文相关的资讯。此外,我们还提出了一种由语言驱动的交通场景生成流程,在此过程中,ChatBEV促进了对地图的理解以及文本与导航指导的一致性,显著提升了现实主义和一致性的交通场景生成。 该数据集、代码及微调后的模型将在未来发布。
https://arxiv.org/abs/2503.13938
Large language models (LLMs) have been one of the most important discoveries in machine learning in recent years. LLM-based artificial intelligence (AI) assistants, such as ChatGPT, have consistently attracted the attention from researchers, investors, and the general public, driving the rapid growth of this industry. With the frequent introduction of new LLMs to the market, it becomes increasingly difficult to differentiate between them, creating a demand for new LLM comparison methods. In this research, the Consistency-focused Similarity Comparison Framework (ConSCompF) for generative large language models is proposed. It compares texts generated by two LLMs and produces a similarity score, indicating the overall degree of similarity between their responses. The main advantage of this framework is that it can operate on a small number of unlabeled data, such as chatbot instruction prompts, and does not require LLM developers to disclose any information about their product. To evaluate the efficacy of ConSCompF, two experiments aimed at identifying similarities between multiple LLMs are conducted. Additionally, these experiments examine the correlation between the similarity scores generated by ConSCompF and the differences in the outputs produced by other benchmarking techniques, such as ROUGE-L. Finally, a series of few-shot LLM comparison experiments is conducted to evaluate the performance of ConSCompF in a few-shot LLM comparison scenario. The proposed framework can be used for calculating similarity matrices of multiple LLMs, which can be effectively visualized using principal component analysis (PCA). The ConSCompF output may provide useful insights into data that might have been used during LLM training and help detect possible investment fraud attempts.
近年来,大型语言模型(LLM)是机器学习领域的重要发现之一。基于LLM的人工智能助手,如ChatGPT,持续吸引了研究人员、投资者和公众的广泛关注,推动了该行业的迅速发展。随着市场上不断推出新的LLM,区分它们变得越来越困难,从而产生了对新型LLM比较方法的需求。在这项研究中,提出了一种针对生成式大型语言模型的一致性聚焦相似度对比框架(ConSCompF)。此框架通过对比由两个LLM生成的文本并计算相似分数来衡量这两个模型响应的整体相似程度。该框架的主要优势在于其可以仅依赖少量未标记的数据(如聊天机器人的指令提示)进行操作,并且无需LLM开发者披露任何产品信息。 为了评估ConSCompF的有效性,开展了两项旨在识别多个LLM之间相似性的实验。此外,这些实验还考察了由ConSCompF生成的相似度分数与其它基准测试技术(例如ROUGE-L)产生的输出差异之间的相关性。最后,开展了一系列少样本LLM对比实验以评估ConSCompF在少样本场景下的表现。 提出的框架可用于计算多个LLM的相似矩阵,并且可以通过主成分分析(PCA)有效可视化这些矩阵。ConSCompF的输出可以为可能用于LLM训练的数据提供有价值的见解,并有助于检测潜在的投资欺诈企图。
https://arxiv.org/abs/2503.13923
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at this https URL.
大型视觉-语言模型(LVLMs)在视觉和语言理解及推理任务中表现出色,但其视觉理解行为仍待深入研究。一个基本问题是:这些模型在多大程度上依赖于视觉输入,哪些图像区域对其响应有贡献?由于这些模型复杂的视觉架构(例如,多个编码器和多种分辨率)以及可变长度的输出,解释它们自由形式生成的内容并非易事。 在这篇论文中,我们扩展了现有的热图可视化方法(如iGOS++),以支持大型视觉语言模型在开放式视觉问答任务中的应用。此外,我们提出了一种选择与视觉相关的令牌的方法,这些令牌反映了生成的答案和输入图像之间的相关性。我们还在专门设计需要利用视觉信息来回答问题的基准测试上对最先进的LVLM进行了全面分析。 我们的研究结果为理解LVLM行为提供了重要见解,包括关注区域与答案准确性之间的关系、不同架构间的视觉注意差异以及大模型规模对视觉理解的影响。代码和数据可在上述链接中获取。
https://arxiv.org/abs/2503.13891
Large Language Models (LLMs) have made significant progress in various fields. However, challenges remain in Multi-Disciplinary Team (MDT) medical consultations. Current research enhances reasoning through role assignment, task decomposition, and accumulation of medical experience. Multi-role collaboration in MDT consultations often results in excessively long dialogue histories. This increases the model's cognitive burden and degrades both efficiency and accuracy. Some methods only store treatment histories. They do not extract effective experience or reflect on errors. This limits knowledge generalization and system evolution. We propose a multi-agent MDT medical consultation framework based on LLMs to address these issues. Our framework uses consensus aggregation and a residual discussion structure for multi-round consultations. It also employs a Correct Answer Knowledge Base (CorrectKB) and a Chain-of-Thought Knowledge Base (ChainKB) to accumulate consultation experience. These mechanisms enable the framework to evolve and continually improve diagnosis rationality and accuracy. Experimental results on the MedQA and PubMedQA datasets demonstrate that our framework achieves accuracies of 90.1% and 83.9%, respectively, and that the constructed knowledge bases generalize effectively across test sets from both datasets.
大型语言模型(LLM)在多个领域取得了显著进展。然而,在跨学科团队(MDT)的医学咨询中仍存在挑战。当前的研究通过角色分配、任务分解和积累医疗经验来增强推理能力。但在多角色协作的MDT咨询过程中,通常会产生过长的对话历史记录。这会增加模型的认知负担,并降低效率和准确性。一些方法仅存储治疗史,而不提取有效经验和反思错误,从而限制了知识泛化和系统的进化。 为了解决这些问题,我们提出了一种基于LLM的多代理MDT医疗咨询框架。该框架采用共识聚合和残差讨论结构进行多次轮询,并利用正确答案知识库(CorrectKB)和思想链知识库(ChainKB)积累咨询经验。这些机制使框架能够进化并持续提高诊断合理性和准确性。 在MedQA和PubMedQA数据集上的实验结果表明,我们的框架分别实现了90.1%和83.9%的准确率,并且构建的知识库能够在两个数据集中测试集合上有效泛化。
https://arxiv.org/abs/2503.13856
In recent years, large language models (LLMs) have revolutionized the field of natural language processing. However, they often suffer from knowledge gaps and hallucinations. Graph retrieval-augmented generation (GraphRAG) enhances LLM reasoning by integrating structured knowledge from external graphs. However, we identify two key challenges that plague GraphRAG:(1) Retrieving noisy and irrelevant information can degrade performance and (2)Excessive reliance on external knowledge suppresses the model's intrinsic reasoning. To address these issues, we propose GraphRAG-FI (Filtering and Integration), consisting of GraphRAG-Filtering and GraphRAG-Integration. GraphRAG-Filtering employs a two-stage filtering mechanism to refine retrieved information. GraphRAG-Integration employs a logits-based selection strategy to balance external knowledge from GraphRAG with the LLM's intrinsic reasoning,reducing over-reliance on retrievals. Experiments on knowledge graph QA tasks demonstrate that GraphRAG-FI significantly improves reasoning performance across multiple backbone models, establishing a more reliable and effective GraphRAG framework.
近年来,大型语言模型(LLM)在自然语言处理领域取得了革命性的进展。然而,它们常常存在知识缺口和幻觉问题。图检索增强生成(GraphRAG)通过整合外部图中的结构化知识来提升LLM的推理能力。但是,我们发现了两个困扰GraphRAG的关键挑战:(1) 检索到嘈杂且无关的信息会降低性能;(2) 过度依赖于外部知识会抑制模型本身的内在推理能力。为了解决这些问题,我们提出了GraphRAG-FI(过滤与集成)方法,包括GraphRAG-Filtering和GraphRAG-Integration两个部分。GraphRAG-Filtering采用两阶段过滤机制来精炼检索到的信息;而GraphRAG-Integration则使用基于logits的选择策略,在平衡外部知识和LLM的内在推理之间找到一个折衷点,从而减少对检索信息的过度依赖。在知识图谱问答任务上的实验表明,GraphRAG-FI显著提升了多种骨干模型的推理性能,建立了一个更可靠、有效的GraphRAG框架。
https://arxiv.org/abs/2503.13804
The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.
大型视觉-语言模型(LVLM)的演进已经从单一图像推理发展到多图像推理。尽管取得了这一进步,我们的研究结果表明,LVLM在跨多个图像利用信息方面仍然存在困难,其预测会受到图像位置变化的重大影响。为更深入地探讨这个问题,我们引入了位置问题回答(PQA),这是一种精心设计的任务,旨在量化每个位置上的推理能力。分析显示,LVLM中存在显著的位置偏差:开源模型在处理位于后面但开始和中间的图片时表现较差,而专有模型则在理解开头和结尾处的图像方面表现出色,但在处理中间的图像时却显得力不从心。 鉴于此现象,我们提出了SoFt注意力机制(SoFA),这是一种简单且无需训练的方法,通过使用图像间的因果注意与双向注意之间的线性插值来缓解这种偏差。实验结果显示,SoFA能够减少位置偏差,并提高现有LVLM的推理性能。
https://arxiv.org/abs/2503.13792
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at this https URL, and project page at this https URL.
科学研究需要对多模态数据进行复杂的推理,尤其是在生物学领域尤为突出。尽管在人工智能辅助研究方面取得了多模态大型语言模型(MLLM)的最新进展,但现有的多模态推理基准测试仅针对大学水平难度,并且专注于较低级别的感知能力,这远远不足以应对科学研究中所需的复杂多模态推理需求。为弥合这一差距,我们引入了MicroVQA,这是一个视觉问答(VQA)基准,旨在评估研究工作流程中的三种关键推理能力:专家级图像理解、假设生成和实验提案。MicroVQA包含1,042个由生物学家在各种显微镜模式下精心策划的多项选择题,确保这些VQA样本能够代表真正的科学实践。 在构建该基准的过程中,我们发现标准的MCQ(多项选择题)生成方法会导致语言捷径问题,这促使了新两阶段管道的开发:一个优化的语言模型提示结构将问答对转换为MCQ;然后通过基于代理的“RefineBot”对其进行更新以去除这些捷径。 在最先进的MLLM上进行基准测试显示,在该任务上的最佳性能仅为53%。使用较小语言模型构建的模型仅略低于顶级模型,这表明基于语言的推理比多模态推理更具挑战性;并且使用科学文章进行微调可以提高性能。 专家对链式思维响应的分析表明,感知错误是最常见的问题,其次是知识错误和过度泛化错误。这些见解突显了多模态科学推理中的挑战,并证明MicroVQA是推动AI驱动生物医学研究的重要资源。 MicroVQA可在此[URL]获取,项目页面在此[URL]。
https://arxiv.org/abs/2503.13399
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.
多模态大型语言模型(MLLMs)在二维视觉理解方面表现出色,但在处理三维空间的推理能力上仍存在局限。在这项工作中,我们利用大规模高质量的3D场景数据及其开放集标注来引入1)一个新的监督微调数据集和2)一个新的评估基准,专注于室内场景。我们的Cubify Anything VQA (CA-VQA) 数据涵盖了多种空间任务,包括空间关系预测、度量尺寸和距离估计以及三维定位。我们展示了CA-VQA如何帮助训练MM-Spatial,这是一个强大的通才型MLLM,并在3D空间理解的基准测试中实现了最先进的性能,包括我们自己的基准。我们还展示了结合度量深度和多视图输入(在CA-VQA中提供)可以进一步提高对三维的理解能力,并证明了仅通过数据,我们的模型就能实现与专为单目深度估计设计的模型相当的深度感知能力。我们将发布我们的SFT数据集和基准测试。
https://arxiv.org/abs/2503.13111
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models.
我们提出了一项新的任务,用于评估具身代理人在场景中理解人类的能力:场景内人物问答(HIS-QA)。给定一个三维场景中的人员动作,HIS-QA要求代理人能够理解人的状态和行为,推理其周围环境,并回答有关场景中人物的问题。为了支持这一新任务,我们提出了HIS-Bench,这是一个多模态基准测试工具,系统地评估从基本感知到常识推理与规划的广泛范围内的HIS理解能力。我们在HIS-Bench上对各种视觉-语言模型进行的评估揭示了它们在处理HIS-QA任务方面的重大局限性。 为了解决这一问题,我们提出了HIS-GPT,这是第一款专门用于HIS理解的基础模型。HIS-GPT将三维场景上下文和人物动作动态整合到大型语言模型中,并结合特殊机制以捕捉人与场景之间的互动。广泛的实验表明,HIS-GPT在HIS-QA任务上建立了新的最先进的水平。我们希望这项工作能够激发未来对3D场景中人类行为分析的研究,推动具身AI和世界模型的发展。
https://arxiv.org/abs/2503.12955
With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency while maintaining identical task performance, compared with traditional image compression methods.
随着大规模多模态模型(LMMs)的快速发展,多模态理解应用正逐渐兴起。由于大多数LMM推理请求来源于计算能力有限的边缘设备,主流的推理管道是直接将输入数据转发到边缘服务器进行所有计算处理。然而,这种做法会因为边缘设备的上行带宽限制而引入高传输延迟,并且由于大量视觉标记造成的显著计算延迟,阻碍了对延迟敏感的任务并降低了用户体验。为解决这一挑战,我们提出了一种面向任务的功能压缩(TOFC)方法,在设备-边缘协同推理框架中用于多模态理解。这种方法通过聚类合并视觉特征,并在特征投影之前使用可学习和选择性的熵模型进行编码。具体来说,我们采用了基于K最近邻的密度峰值聚类技术来减少视觉特征的数量,从而最小化数据传输和计算复杂度。随后,利用带有超先验的可学习熵模型对合并后的特征进行编码和解码,进一步降低传输开销。为了提高压缩效率,根据视觉特征的特点自适应选择多个熵模型,实现更准确的概率分布估计。 在七个视觉问答基准测试中的综合实验验证了所提出的TOFC方法的有效性。结果显示,与传统的图像压缩方法相比,TOFC方法能够将数据传输开销降低高达60%,同时将系统延迟减少50%,并且保持相同的任务性能。
https://arxiv.org/abs/2503.12926
Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.
大型语言模型(LLMs)通过“思考后响应”范式展示了增强的性能,其中模型在生成最终回复之前先产生内部思维过程(即系统2思维)。然而,现有研究缺乏对不同规模模型中思维模式如何影响性能的全面理解。在这项工作中,我们进行了各种思维方式对模型性能的影响的综合分析,并引入了ThinkPatterns-21k数据集,该数据集包含从现有的指令跟随数据集中收集的21,000个指令-响应对(问答),并按五种思维类型分类。 对于每一对指令和响应,我们将添加五个不同的内部思考模式:一种无结构思考(独白)以及四种有结构变体(分解、自问、自我辩论和自我批评),同时保持相同的指令和响应不变。通过对不同规模的模型(参数从30亿到320亿)进行广泛的评估,我们得到了两个关键发现: 1. 小型模型(<30B参数)可以从大多数结构化思考模式中受益,而大型模型(如320B参数),使用像分解这样的结构性思维反而会降低性能。 2. 无结构的独白在整个不同规模的模型之间表现出广泛的有效性。 最后,我们发布了所有数据集、训练日志和各种思考模式下的检查点,旨在提高研究的可重复性,并推动该领域的进一步研究。
https://arxiv.org/abs/2503.12918
We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.
我们通过证据链推理来研究复杂的视频问答问题——识别从多个相关视频部分中提取的时间跨度序列,以及其中的视觉证据。现有模型在多步推理方面存在困难,因为它们对固定数量的关键帧进行均匀采样,这可能导致错过分布在视频各处的重要证据。此外,这些模型缺乏在整个视频更广泛背景下时间定位这种证据的能力,这是回答复杂问题所必需的。我们提出了一种框架,用于增强现有的VideoQA数据集中的证据推理链,通过在视频中搜索带有支持性证据的最佳兴趣区间来自动构建这些证据链,从而最大化回答给定问题的可能性。我们的模型(VITED)经过训练可以直接生成这些证据链,使它能够在长格式的视频内容中定位证据窗口并跨多个步骤进行推理。我们在一系列长期视频问答基准测试中展示了我们基于证据提取的模型的价值,在这些测试中,我们超越了缺乏证据推理能力的最新方法。
https://arxiv.org/abs/2503.12855
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.
移动性仍然是全球约22亿受视力障碍和低视力(BLV)影响的人面临的重大挑战,其中7%的视障人士每月至少会摔倒一次。尽管最近在多模态大型语言模型(MLLMs)方面的进展为BLV人群提供了有希望的帮助机会,但其开发受到了数据集有限的问题制约。这一限制源于需要专门领域知识和大量劳动力进行视力障碍意识注释的事实。 为了填补这一空白,我们引入了GuideDog——一个新颖的无障碍导向数据集,包含22K张图像描述对(包括2K个人类注释对),这些数据涵盖了从行人视角捕捉到的各种现实场景。我们的方法通过建立在已确立的无障碍标准基础之上的协作人机框架,将标注负担从生成转移到验证上,从而显著提高了效率并保持了高质量的标注。 我们还开发了GuideDogQA——一个包含818个样本的子集,这些样本设计有多项选择题,旨在评估细粒度视觉感知能力,尤其是物体识别和相对深度感知。我们的实验结果强调了准确的空间理解对于有效的BLV引导的重要性。 GuideDog和GuideDogQA将推进基于MLLM的辅助技术的研究以帮助BLV人群,并为机器人学和增强现实领域中的第一人称场景理解更广泛的运用做出贡献。代码和数据集将在公开渠道上提供。
https://arxiv.org/abs/2503.12844
Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.
大型视觉-语言模型(LVLMs)在结合视觉理解和语言生成方面取得了显著进展。尽管这些成就令人瞩目,但LVLMs的训练数据仍然面临着长尾问题(LT),即数据分布极度不平衡的问题。先前的研究主要集中在传统的VLM架构(如CLIP或ViT)和特定任务(例如识别和分类)上。然而,对于更通用的任务(如视觉问答和视觉推理)以及LVLM(如LLaVA)的探索仍然相对不足。 在本文中,我们首先深入分析了LVLM中的长尾问题,并确定了两个核心原因:头部概念过度表示和尾部概念不足表示。基于以上观察,我们提出了一个自适应数据精炼框架(ADR),该框架包括两阶段:数据再平衡(DR)和数据合成(DS)。在DR阶段,我们根据实体分布自适应地对多余的数据进行重新调整,在DS阶段,我们利用去噪扩散概率模型(DDPMs)及稀有图像来补充不足的部分。通过跨越十一项基准测试的全面评估,我们的ADR框架有效地缓解了训练数据中的长尾问题,并在不增加训练数据量的情况下相对提高了LLaVA 1.5的平均性能4.36%。
https://arxiv.org/abs/2503.12821