Uncertainty estimation is a significant issue for current large language models (LLMs) that are generally poorly calibrated and over-confident, especially with reinforcement learning from human feedback (RLHF). Unlike humans, whose decisions and confidences not only stem from intrinsic beliefs but can also be adjusted through daily observations, existing calibration methods for LLMs focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom": the interaction among multiple LLMs that can collectively improve both accuracy and calibration. In this work, we propose Collaborative Calibration, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process. We demonstrate the effectiveness of Collaborative Calibration on generative QA tasks across various domains, showing its potential in harnessing the rationalization of collectively calibrated confidence assessments and improving the reliability of model predictions.
不确定性估计是当前大型语言模型(LLMs)的一个显著问题,尤其是对于从人类反馈中进行强化学习(RLHF)的情况。与人类不同,后者不仅基于内在信念,还可以通过日常观察进行调整。现有的LLM calibration方法主要关注估计或激发单个模型的置信,而没有充分利用“集体智慧”这一概念:多个LLM之间相互作用的集体改进 both准确性和校准。在本文中,我们提出了协作校准,一种无需进行后训练的校准策略,它利用了模拟群体讨论过程中多个增强型LLM代理的协作和表现特性。我们展示了协作校准在各种领域的生成式QA任务上的有效性,表明了其在利用集体校准置信评估的合理性以及提高模型预测可靠性的潜力。
https://arxiv.org/abs/2404.09127
In the field of Question Answering (QA), unifying large language models (LLMs) with external databases has shown great success. However, these methods often fall short in providing the advanced reasoning needed for complex QA tasks. To address these issues, we improve over a novel approach called Knowledge Graph Prompting (KGP), which combines knowledge graphs with a LLM-based agent to improve reasoning and search accuracy. Nevertheless, the original KGP framework necessitates costly fine-tuning with large datasets yet still suffers from LLM hallucination. Therefore, we propose a reasoning-infused LLM agent to enhance this framework. This agent mimics human curiosity to ask follow-up questions to more efficiently navigate the search. This simple modification significantly boosts the LLM performance in QA tasks without the high costs and latency associated with the initial KGP framework. Our ultimate goal is to further develop this approach, leading to more accurate, faster, and cost-effective solutions in the QA domain.
在问题回答(QA)领域,将大型语言模型(LLMs)与外部数据库统一的做法取得了巨大的成功。然而,这些方法在提供复杂QA任务所需的高级推理方面常常不足。为解决这些问题,我们改进了一种名为知识图谱提示(KGP)的新方法,该方法将知识图谱与基于LLM的代理相结合以提高推理和搜索精度。然而,原始KGP框架需要对大量数据进行昂贵的微调,但仍存在LLM幻觉的问题。因此,我们提出了一个基于推理的LLM代理以增强这一框架。这个代理模仿人类的好奇心,以更有效地引导搜索。这样的简单修改在不需要初始KGP框架的高昂成本和延迟的情况下显著提高了LLM在QA任务中的性能。我们的最终目标是进一步发展这种方法,为QA领域提供更准确、更快、更经济有效的解决方案。
https://arxiv.org/abs/2404.09077
In the realm of media technology, digital humans have gained prominence due to rapid advancements in computer technology. However, the manual modeling and control required for the majority of digital humans pose significant obstacles to efficient development. The speech-driven methods offer a novel avenue for manipulating the mouth shape and expressions of digital humans. Despite the proliferation of driving methods, the quality of many generated talking head (TH) videos remains a concern, impacting user visual experiences. To tackle this issue, this paper introduces the Talking Head Quality Assessment (THQA) database, featuring 800 TH videos generated through 8 diverse speech-driven methods. Extensive experiments affirm the THQA database's richness in character and speech features. Subsequent subjective quality assessment experiments analyze correlations between scoring results and speech-driven methods, ages, and genders. In addition, experimental results show that mainstream image and video quality assessment methods have limitations for the THQA database, underscoring the imperative for further research to enhance TH video quality assessment. The THQA database is publicly accessible at this https URL.
在媒体技术领域,数字人因计算机技术的快速发展而取得了突出地位。然而,大多数数字人所需的手动建模和控制对高效开发造成了巨大的障碍。语音驱动的方法为操纵数字人的嘴形状和表情提供了一个新颖的途径。尽管驱动方法的普及,但许多生成的交谈头(TH)视频的质量仍然令人担忧,影响了用户的视觉体验。为解决这个问题,本文介绍了 Talking Head Quality Assessment (THQA) 数据库,该数据库通过8种不同的语音驱动方法生成了800个TH视频。广泛的实验证实了THQA数据库的角色和语音特征的丰富性。后续的主观质量评估实验分析了评分结果与语音驱动方法、年龄和性别的相关性。此外,实验结果表明,主流图像和视频质量评估方法对THQA数据库存在局限性,进一步研究以提高TH视频质量评估的必要性。THQA数据库现在可以在此链接公开访问:https://www.THQA-db.com/
https://arxiv.org/abs/2404.09003
In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTić model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
在本文中,我们重点使用自适应Translate-Align-Retrieve方法生成一个合成问题回答(QA)数据集。通过这种方法,我们创建了超过87K个样本的塞尔维亚QA数据集,我们称之为SQuAD-sr。为了承认塞尔维亚的脚本二元性,我们生成了塞尔维亚和拉丁版本的數據集。我们研究了數據集的質量,并使用它来微調多個预训练QA模型的精度。最佳结果是在我们的拉丁SQuAD-sr數據集上微調BERTić模型,实现了73.91%的准确匹配和82.97%的分数,我们在基准XQuAD數據集上的表现。結果表明,我们的模型超过了零散的基線,但沒有超越人類的表現。我們注意到了使用單語预訓練模型的優勢,以及使用拉丁文比使用 cyrillic 文本来實現的性能增加。通過進行進一步分析,我們發現,數值或日期等數值問題比其他類型的問題更可能得到正確回答。最後,我們得出結論,SQuAD-sr對於在缺乏手動製作和標註的數據集上微調塞尔维亚QA模型是足夠的質量。
https://arxiv.org/abs/2404.08617
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.
视觉问题回答(VQA)被认为是AI完成的任务,因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里,为VQA问题提出了许多神经架构建议。然而,在零散射击VQA上取得成功仍然具有挑战性,因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说,我们探讨了使用图像摘要而不是图像并利用大型语言模型(LLMs)建立零散射击设置的有效性。 由于图像摘要是这个过程中最关键的一步,因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法,将上下文信息传递给问题回答(QA)模型。这种方法涉及从问题中提取关键词,为图像-问题对生成文本摘要,并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。 我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力,以实现GQA竞争力的性能。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.08589
Effective ontology transfer has been a major goal of recent work on event argument extraction (EAE). Two methods in particular -- question answering (QA) and template infilling (TI) -- have emerged as promising approaches to this problem. However, detailed explorations of these techniques' ability to actually enable this transfer are lacking. In this work, we provide such a study, exploring zero-shot transfer using both techniques on six major EAE datasets at both the sentence and document levels. Further, we challenge the growing reliance on LLMs for zero-shot extraction, showing that vastly smaller models trained on an appropriate source ontology can yield zero-shot performance superior to that of GPT-3.5 or GPT-4.
有效的本体转移一直是事件论证提取(EAE)领域最近工作的主要目标。尤其是问答(QA)和模板填充(TI)两种方法——被认为是解决这个问题的有前途的方法。然而,这些技术实际实现这一转移的能力的详细探讨还缺乏。在这项工作中,我们提供了这样的研究,探讨了在句子和文档级别上使用这两种技术进行零散转移。此外,我们还挑战了越来越多地依赖LLM进行零散提取的趋势,证明了在适当的本体架构上训练的小规模模型可以产生与GPT-3.5或GPT-4.0相当甚至更好的零散性能。
https://arxiv.org/abs/2404.08579
In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.
在当今的数字世界中,通过互联网寻求健康问题的答案是一种常见的行为。然而,现有的问题回答(QA)系统通常依赖于使用预先选择和注释的证据文献,因此它们对解决新颖问题是不够的。我们的研究关注的是开放领域的QA设置,其中关键挑战是首先在大型知识库中发掘相关的证据。通过利用常见的检索-然后阅读QA管道和PubMed作为值得信赖的医疗研究文献集合,我们从三个不同的数据集中回答健康问题。我们修改不同的检索设置,以观察它们对QA管道性能的影响,包括检索到的文档数量、句子选择过程、文章的出版年份以及它们的引用次数。我们的结果表明,减少检索到的文档数量并倾向于选择更多最新和高度引用的文档可以提高最终宏观F1得分至10%。我们讨论了结果,重点有趣的例子,并概述了未来研究的挑战,如处理证据分歧和构建用户友好的解释。
https://arxiv.org/abs/2404.08359
Several previous studies have considered language- and domain-specific large language models (LLMs) as separate topics. This study explores the combination of a non-English language and a high-demand industry domain, focusing on a Japanese business-specific LLM. This type of a model requires expertise in the business domain, strong language skills, and regular updates of its knowledge. We trained a 13-billion-parameter LLM from scratch using a new dataset of business texts and patents, and continually pretrained it with the latest business documents. Further we propose a new benchmark for Japanese business domain question answering (QA) and evaluate our models on it. The results show that our pretrained model improves QA accuracy without losing general knowledge, and that continual pretraining enhances adaptation to new information. Our pretrained model and business domain benchmark are publicly available.
之前的研究将语言和领域特定的 large language models (LLMs) 视为单独的主题。本研究探讨了将非英语语言和高需求行业领域相结合,重点关注日语商务特定 LLM 的组合。这种模型需要掌握业务领域专业知识、强大的语言技能和对知识的定期更新。我们从头训练了一个包含 130 亿参数的 LLM,并不断用最新的商务文件预热它。此外,我们还为日本商务领域问题回答 (QA) 提出了一个新的基准,并评估了我们的模型在它上的表现。研究结果表明,我们的预训练模型在没有失去一般知识的情况下提高了 QA 准确性,而持续预训练则增强了对新信息的适应。我们的预训练模型和商务领域基准都是公开可用的。
https://arxiv.org/abs/2404.08262
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
可扩展的标注方法对于构建广泛的3D文本数据集至关重要,促进了一系列应用。然而,现有的方法有时会导致生成伪影的旁注,从而损害了旁注的质量。本文重点探讨了在3D物体旁注中出现伪影的问题,重点关注Cap3D方法,该方法使用预训练模型将3D物体转换为2D视图进行旁注。我们指出了一个主要挑战:某些3D物体的渲染视图是非典型的,与标准图像旁注模型的训练数据不一致,导致伪影。为解决这个问题,我们提出了DiffuRank方法,该方法利用预训练的文本到3D模型评估3D物体与其2D渲染视图之间的对齐程度,其中高对齐视图最能代表对象的特性。通过排名所有渲染视图并将排名前几位的输入GPT4-Vision,我们提高了旁注的准确性和细节,使得在Cap3D数据集中的20000个旁注和将它们扩展到Objaverse和Objaverse-XL数据集中的100000个旁注得到纠正。此外,我们还展示了DiffuRank的适应性,将其应用于预训练的文本到图像模型上进行视觉问答任务,其中它超过了CLIP模型。
https://arxiv.org/abs/2404.07984
Public Code Review (PCR) can be implemented through a Software Question Answering (SQA) community, which facilitates high knowledge dissemination. Current methods mainly focus on the reviewer's perspective, including finding a capable reviewer, predicting comment quality, and recommending/generating review comments. Our intuition is that satisfying review necessity requests can increase their visibility, which in turn is a prerequisite for better review responses. To this end, we propose a unified framework called UniPCR to complete developer-based request quality assurance (i.e., predicting request necessity and recommending tags subtask) under a Masked Language Model (MLM). Specifically, we reformulate both subtasks via 1) text prompt tuning, which converts two subtasks into MLM by constructing prompt templates using hard prompt; 2) code prefix tuning, which optimizes a small segment of generated continuous vectors as the prefix of the code representation using soft prompt. Experimental results on the Public Code Review dataset for the time span 2011-2022 demonstrate that our UniPCR framework adapts to the two subtasks and outperforms comparable accuracy-based results with state-of-the-art methods for request quality assurance. These conclusions highlight the effectiveness of our unified framework from the developer's perspective in public code review.
公共代码审查(PCR)可以通过一个软件问题回答(SQA)社区来实现,这将促进高知识传播。目前的方法主要关注审查者的观点,包括寻找一个有能力的审查者、预测评论质量和推荐/生成评论。我们相信,满足审查者必要要求可以增加他们的可见度,从而成为更好的评论回复的前提条件。为此,我们提出了一个统一框架,名为UniPCR,以在基于掩码语言模型的(MLM)中完成开发人员为基础的请求质量保障(即预测需求和推荐标签子任务)。具体来说,我们通过1)文本提示调整,将两个子任务转化为MLM,通过使用硬提示构建提示模板;2)代码前缀调整,通过使用软提示将生成的连续向量的一部分优化为代码表示的预前缀。对2011-2022年公共代码评论数据集的实验结果表明,我们的UniPCR框架适应这两个子任务,并且与最先进的基于请求质量的保障方法相比,具有卓越的性能。这些结论突出了从开发者角度看待我们统一框架在公共代码审查中的有效性。
https://arxiv.org/abs/2404.07942
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: this https URL.
这项研究介绍了一种名为DesignQA的新基准,旨在评估多模态大型语言模型(MLLM)在理解和技术文档中应用工程要求的能力。该基准重点关注现实世界的工程挑战,将多模态数据(包括文本设计要求、CAD图像和工程图纸)来源于方程式SAE学生竞赛,与许多现有MLLM基准不同。DesignQA包含基于文档的视觉问题,其中输入图像和输入文档来自不同的来源。基准基于工程师根据要求进行设计时执行的任务进行划分-规则理解、规则遵守和规则提取。我们评估了最先进的GPT4和LLaVA模型与该基准的比较,我们的研究揭示了MLLM在解释复杂工程文档方面的能力所存在的现有缺口。研究发现,尽管MLLM表现出在导航技术文档方面的潜力,但仍然存在很大的局限性,特别是在准确提取和应用详细工程设计要求方面。这项基准为支持AI辅助工程设计过程的未来发展奠定了基础。DesignQA可以在以下链接公开使用:https://this URL。
https://arxiv.org/abs/2404.07917
Unsupervised anomaly detection enables the identification of potential pathological areas by juxtaposing original images with their pseudo-healthy reconstructions generated by models trained exclusively on normal images. However, the clinical interpretation of resultant anomaly maps presents a challenge due to a lack of detailed, understandable explanations. Recent advancements in language models have shown the capability of mimicking human-like understanding and providing detailed descriptions. This raises an interesting question: \textit{How can language models be employed to make the anomaly maps more explainable?} To the best of our knowledge, we are the first to leverage a language model for unsupervised anomaly detection, for which we construct a dataset with different questions and answers. Additionally, we present a novel multi-image visual question answering framework tailored for anomaly detection, incorporating diverse feature fusion strategies to enhance visual knowledge extraction. Our experiments reveal that the framework, augmented by our new Knowledge Q-Former module, adeptly answers questions on the anomaly detection dataset. Besides, integrating anomaly maps as inputs distinctly aids in improving the detection of unseen pathologies.
无监督异常检测通过将原始图像与仅基于正常图像的模型生成的伪健康重构图像相邻来识别潜在的病理性区域。然而,由于结果异常图的临床解释存在缺乏详细、可理解解释的挑战,这是一个具有挑战性的问题。近年来语言模型的进步表明,具有类似于人类理解能力和提供详细描述的能力。这引发了一个有趣的问题:\textit{语言模型如何被用于使异常图更具有可解释性?}据我们所知,我们第一个利用语言模型进行无监督异常检测,为我们构建了一个不同问题和不回答的问答 dataset。此外,我们提出了一个专门针对异常检测的多图像视觉问答框架,结合了各种特征融合策略来增强视觉知识提取。我们的实验表明,在将新知识 Q-Former 模块扩展到框架后,该框架能够恰当地回答异常检测数据集中的问题。此外,将异常图作为输入可以明显地改善未见疾病的检测。
https://arxiv.org/abs/2404.07622
While Large Language Models (LLMs) can achieve human-level performance in various tasks, they continue to face challenges when it comes to effectively tackling multi-step physics reasoning tasks. To identify the shortcomings of existing models and facilitate further research in this area, we curated a novel dataset, MM-PhyQA, which comprises well-constructed, high schoollevel multimodal physics problems. By evaluating the performance of contemporary LLMs that are publicly available, both with and without the incorporation of multimodal elements in these problems, we aim to shed light on their capabilities. For generating answers for questions consisting of multimodal input (in this case, images and text) we employed Zero-shot prediction using GPT-4 and utilized LLaVA (LLaVA and LLaVA-1.5), the latter of which were fine-tuned on our dataset. For evaluating the performance of LLMs consisting solely of textual input, we tested the performance of the base and fine-tuned versions of the Mistral-7B and LLaMA2-7b models. We also showcased the performance of the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique, which when used to train LLaVA-1.5 13b yielded the best results when tested on our dataset, with superior scores in most metrics and the highest accuracy of 71.65% on the test set.
大规模语言模型(LLMs)在各种任务上可以达到人类水平的表现,但在解决多步物理推理任务方面仍然面临着挑战。为了识别现有模型的不足,并促进该领域的进一步研究,我们创建了一个名为MM-PhyQA的新数据集,它包括构建良好且高中水平的多模态物理问题。通过评估具有公共可得性 contemporary LLM 的性能,以及这些问题中多模态元素的包含情况,我们旨在阐明其能力。为生成多模态输入(例如图像和文本)问题的答案,我们使用了 GPT-4 的零样本预测,并利用了LLaVA(LLaVA 和 LLaVA-1.5)中的后者的优化版本,该版本在我们的数据集上进行了微调。为了评估仅包含文本输入的 LLM 的性能,我们测试了 Mistral-7B 和 LLaMA2-7b 模型的基版和微调版本。我们还展示了名为 Multi-Image Chain-of-Thought(MI-CoT)提示技术在训练 LLaVA-1.5 13b 时的表现,该技术在我们的数据集上进行测试时,在大多数指标上的表现优于其他方法,同时在测试集中的准确度为 71.65%。
https://arxiv.org/abs/2404.08704
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
将大型语言模型(LLMs)集成到视觉领域任务中,产生了视觉-LLM(V-LLM),在视觉语言任务中实现了卓越的性能,特别是对于视觉问答(VQA)。然而,现有的V-LLM(例如BLIP-2,LaVaA)表明缺乏空间推理和局部定位意识。尽管生成高度描述性和详细的文本答案,但这些模型在简单任务(如区分左右位置)上表现不佳。在这项工作中,我们探讨了图像空间坐标基于指令微调目标如何将空间意识注入V-LLM。我们发现了最优的坐标表示、数据有效的指令微调目标以及伪数据生成策略,从而提高了V-LLM的空间意识。此外,我们的结果模型在图像和视频域的VQA中 improved,减少了不必要的幻觉,并生成了更好的上下文物体描述。在涉及14个不同数据集的5个视觉语言任务上进行实验,证明了我们提出的框架所取得的明显性能提升。
https://arxiv.org/abs/2404.07449
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
视觉语言模型(VLMs)通常由一个视觉编码器(例如CLIP)和一个语言模型(LM)组成,用于解决下游任务。尽管在可见的进步,但VLMs由于视觉编码器的能力有限,例如对某些图像特征的“盲目”或视觉幻觉等问题,而存在多个缺陷。为了应对这些问题,我们研究了扩大VLM视觉编码器的能力。我们首先全面基准了为解决VLM任务而采用的不同归纳偏置的几个视觉编码器。我们观察到,没有一种编码器配置能够 consistently在不同的任务上取得最佳性能,并且具有不同偏置的编码器可以表现出意外的相似性。为了激励这一点,我们引入了一种名为BRAVE的方法,将多个冻结的编码器的特征合并到一个更通用的表示中,可以直接输入到冻结的语言模型中。BRAVE在广泛的摘要和VQA基准测试中实现了最先进的性能,并显著减少了上述VLMs存在的问题,而需要的训练参数比现有方法要少得多,并且具有更紧凑的表示。我们的结果突出了将不同视觉偏见集成到更广泛的上下文视觉理解中的VLMs的潜力。
https://arxiv.org/abs/2404.07204
Complex logical query answering (CLQA) in knowledge graphs (KGs) goes beyond simple KG completion and aims at answering compositional queries comprised of multiple projections and logical operations. Existing CLQA methods that learn parameters bound to certain entity or relation vocabularies can only be applied to the graph they are trained on which requires substantial training time before being deployed on a new graph. Here we present UltraQuery, an inductive reasoning model that can zero-shot answer logical queries on any KG. The core idea of UltraQuery is to derive both projections and logical operations as vocabulary-independent functions which generalize to new entities and relations in any KG. With the projection operation initialized from a pre-trained inductive KG reasoning model, UltraQuery can solve CLQA on any KG even if it is only finetuned on a single dataset. Experimenting on 23 datasets, UltraQuery in the zero-shot inference mode shows competitive or better query answering performance than best available baselines and sets a new state of the art on 14 of them.
知识图(KG)中的复杂逻辑查询(CLQA)超越了简单的KG完成,旨在回答由多个投影和逻辑操作组成的复合查询。现有的CLQA方法只能应用于它们所训练的图,这需要在新图上进行大量训练时间才能部署。在这里,我们提出了UltraQuery,一种归纳推理模型,可以在任何KG上零 shots地回答逻辑查询。UltraQuery的核心思想是将投影和逻辑操作作为独立于词汇的函数,扩展到任何KG中的新实体和关系。通过从预训练的归纳KG推理模型中初始化投影操作,UltraQuery可以在仅针对单个数据集微调的情况下解决CLQA。在23个数据集上的实验表明,UltraQuery在零 shot推理模式下具有与最佳现有基线竞争或更好的查询回答性能,其中14个数据集的性能达到了当前最先进的水平。
https://arxiv.org/abs/2404.07198
We present an empirical study of groundedness in long-form question answering (LFQA) by retrieval-augmented large language models (LLMs). In particular, we evaluate whether every generated sentence is grounded in the retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded, even when those sentences contain correct ground-truth answers. Additionally, we examine the impacts of factors such as model size, decoding strategy, and instruction tuning on groundedness. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations. This study provides novel insights into the groundedness challenges in LFQA and underscores the necessity for more robust mechanisms in LLMs to mitigate the generation of ungrounded content.
我们提出了一个关于长篇问题回答(LFQA)中结实性(groundedness)的实证研究,使用了检索增强的大型语言模型(LLMs)。特别是,我们评估了每个生成的句子是否基于检索到的文档或模型的预训练数据。在3个数据集和4个模型家族上,我们的研究结果表明,即使包含正确答案,生成的句子中仍然存在着显著的未结实部分。此外,我们研究了因素(如模型大小,解码策略和指令调整)对结实性的影响。我们的结果表明,虽然较大的模型往往能够更有效地将输出结实化,但仍有相当比例的正确答案受到幻觉的影响。这项研究为LFQA中结实性挑战提供了新的见解,并强调了在LLMs中需要更稳健的机制来防止生成无结实的内容。
https://arxiv.org/abs/2404.07060
Recently, the area of adversarial attacks on image quality metrics has begun to be explored, whereas the area of defences remains under-researched. In this study, we aim to cover that case and check the transferability of adversarial purification defences from image classifiers to IQA methods. In this paper, we apply several widespread attacks on IQA models and examine the success of the defences against them. The purification methodologies covered different preprocessing techniques, including geometrical transformations, compression, denoising, and modern neural network-based methods. Also, we address the challenge of assessing the efficacy of a defensive methodology by proposing ways to estimate output visual quality and the success of neutralizing attacks. Defences were tested against attack on three IQA metrics -- Linearity, MetaIQA and SPAQ. The code for attacks and defences is available at: (link is hidden for a blind review).
近年来,针对图像质量度量的对抗攻击领域开始受到关注,而防御领域的研究仍较为不足。在这项研究中,我们将探讨这一情况,并检查图像分类器用于IQA方法中的对抗净化防御的传输性。在本文中,我们针对多个广泛使用的攻击对IQA模型进行了研究,并检查了它们对攻击的防御效果。所涵盖的净化方法包括几何变换、压缩、去噪和基于现代神经网络的方法。此外,我们还提出了估计输出视觉质量和抵消攻击成功性的方法,以评估防御策略的有效性。防御措施针对三个IQA指标——线性ity,元IQA和SPAQ进行了测试。攻击和防御代码可在此处查看:(链接被隐藏以进行盲评)。
https://arxiv.org/abs/2404.06957
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
本文通过分解多级模块推理框架来解决视频问答(videoQA)任务。之前的方法已经通过单个规划阶段在不基于视觉内容的简单有效的基线上了表现出良好的效果。然而,通过一个简单而有效的基准,我们发现,对于具有挑战性的视频QA设置,这样的系统在实践中会导致脆性行为。因此,与传统单阶段规划方法不同,我们提出了一个由事件解析器、基线阶段和最终推理阶段以及外部记忆组成的多阶段系统。所有阶段都是训练免费的,并通过大型模型的少样本提示来执行,在每个阶段产生可解释的中间输出。通过分解规划和任务的底层复杂性,我们的方法MoReVQA在现有视频QA基准(NExT-QA,iVQA,EgoSchema,ActivityNet-QA)上取得了最先进的结果,并扩展到相关任务(基于内容的视频QA,段落标题)。
https://arxiv.org/abs/2404.06511
Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at this https URL.
最近,大型语言模型(LLM)社区越来越关注增强LLMs处理极其长文档的能力。随着各种长文本技术和模型的出现,对模型长文本能力的精确和详细评估变得越来越重要。现有的长文本评估基准,如L-Eval和LongBench,基于开源数据集构建长文本测试集,主要关注问答和总结任务。这些数据集包括长度不同的测试样本(从2k到32k+)相互交织,使得在不同长度范围内评估模型能力具有挑战性。此外,它们也没有涵盖最新的LLM声称的极长设置(100k+个标记)。在本文中,我们介绍了Ada-LEval,一个用于评估LLM长文本理解能力的可调整基准。Ada-LEval包括两个具有挑战性的子集:TSort和BestAnswer,使得对LLM长文本能力的更可靠的评估成为可能。这些基准支持对测试用例长度进行精细操作,并能轻松生成长达128k个标记的文本样本。我们用Ada-LEval评估了4个最先进的闭源API模型和6个开源模型。评估结果表明,当前LLM在超长文本设置中的局限性尤为明显。我们的代码可在此处访问:https://url.cn/adalearn
https://arxiv.org/abs/2404.06480