Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
大规模语言模型(LLMs)在大型语料库上预训练,并在许多通用自然语言处理(NLP)任务中表现出色,如问题回答(QA)。尽管它们具有高级语言能力,但在领域特定和知识密集型任务上,LLMs会受到幻觉、知识截止和知识归因不足的困扰。此外,将LLM的固有知识细分为高度特定的领域是一个耗时且昂贵的过程。最近,检索增强生成(RAG)过程作为一种优化LLM响应的方法而出现,通过将它们与预定义的语义网络参考。研究表明,使用知识图(KG)语义网络对RAG具有更好的QA准确率,通过考虑到相关的子图以保留结构化信息。在本文中,我们介绍了一个高度领域特定的LLM框架SMART-SLIC,该框架将RAG与KG和事实领域特定信息向量存储(VS)集成在一起。重要的是,为了避免知识库中的幻觉,我们通过NLP、数据挖掘和非负张量分解自动选择模型来构建这些高度领域特定的KGs和VS,而不是使用LLM。将我们的RAG与领域特定的: (i) KG(包含结构化信息)和(ii) VS(包含非结构化信息)相结合,可以开发出领域特定的聊天机器人,能够归因信息的来源、减轻幻觉、降低对细调的需求并擅长高度领域特定的问题回答任务。我们将SMART-SLIC与链式思考提示代理商相结合。该框架旨在适用于任何具体或专业领域。本文我们还展示了我们在关于恶意软件分析和检测领域的知识库上问题回答能力的实证研究。
https://arxiv.org/abs/2410.02721
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
视频大型多模态模型(LMMs)的发展受到了从互联网上收集大量高质量原始数据困难的影响。为解决这个问题,我们提出了一个替代方法,即创建一个专门针对视频指令跟随的高质量合成数据集,称为LLaVA-Video-178K。这个数据集包括详细的字幕、开放性问题回答(QA)和多选题QA等关键任务。通过在这个数据集上训练,并与现有的视觉指令调整数据相结合,我们引入了LLaVA-Video,一种新的视频LMM。我们的实验结果表明,LLaVA-Video在各种视频基准测试中取得了强劲的性能,突出了我们数据集的有效性。我们计划发布该数据集、生成流程以及模型检查点。
https://arxiv.org/abs/2410.02713
Recent progress in generative models has stimulated significant innovations in many fields, such as image generation and chatbots. Despite their success, these models often produce sketchy and misleading solutions for complex multi-agent decision-making problems because they miss the trial-and-error experience and reasoning as humans. To address this limitation, we explore a paradigm that integrates a language-guided simulator into the multi-agent reinforcement learning pipeline to enhance the generated answer. The simulator is a world model that separately learns dynamics and reward, where the dynamics model comprises an image tokenizer as well as a causal transformer to generate interaction transitions autoregressively, and the reward model is a bidirectional transformer learned by maximizing the likelihood of trajectories in the expert demonstrations under language guidance. Given an image of the current state and the task description, we use the world model to train the joint policy and produce the image sequence as the answer by running the converged policy on the dynamics model. The empirical results demonstrate that this framework can improve the answers for multi-agent decision-making problems by showing superior performance on the training and unseen tasks of the StarCraft Multi-Agent Challenge benchmark. In particular, it can generate consistent interaction sequences and explainable reward functions at interaction states, opening the path for training generative models of the future.
近年来在生成模型方面的进步,已经在许多领域引起了显著的创新,如图像生成和聊天机器人。尽管这些模型取得了成功,但它们通常在复杂的多智能体决策问题中产生零乱和不准确的解决方案,因为它们无法体验和推理,就像人类一样。为了应对这个局限,我们探讨了一个将语言指导的模拟器整合到多智能体强化学习流程中的范例,以增强生成的答案。模拟器是一个世界模型,它分别学习动态和奖励,其中动态模型包括图像标记者和因果变换器,以生成交互转移自回归,而奖励模型是通过最大化在语言指导下的专家演示轨迹的概率来学习的双向变换器。给出现有状态的图像和任务描述,我们使用世界模型来训练联合策略,并通过在动态模型上运行收敛策略来生成图像序列作为答案。 实证结果表明,这种框架可以通过在 StarCraft Multi-Agent Challenge 基准训练和未见过的任务上表现出卓越的性能来提高多智能体决策问题的答案。特别是,它可以在交互状态下生成一致的交互序列,并解释可解释的奖励函数,为未来训练生成模型开辟了道路。
https://arxiv.org/abs/2410.02664
LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.
随着工作流程中涉及生成供人类消费的内容(例如,市场营销)以及与人类直接互动(例如,通过聊天机器人),LLM(自然语言处理)越来越多地被使用。开发出具有生成可验证的说服力信息的系统,为社会的各个领域带来了机会和挑战。一方面,这样的系统可能会对如广告和社会公益等领域产生积极影响,例如解决吸毒问题;另一方面,它们也可能被用于传播不实信息和塑造政治观点。为了发挥LLM对社会的积极影响,我们需要开发系统来测量和基准它们的说服力。在这个动机下,我们介绍了PersuasionBench和PersuasionArena,第一个大规模基准和包含一系列任务以测量生成模型的说服力的大规模 arena。我们研究了LLM知道并利用哪些语言模式来生成更有说服力的语言的程度。我们的研究结果表明,LLM的说服力与模型大小呈正相关,但较小的模型也可以具有比更大模型更高的说服力。值得注意的是,使用合成和自然数据集进行定向训练显著增强了小模型的说服力,挑战了规模依赖假设。我们的研究结果对模型开发者和政策制定者都具有关键影响。例如,虽然欧盟人工智能法案和加利福尼亚的SB-1047旨在根据浮点运算数量对AI模型进行监管,但我们发现,这样的简单指标单独无法捕捉AI对社会影响的全部范围。我们邀请社区探索并贡献到PersuasionArena和PersuasionBench,可在此链接处访问:https://www.persuasionarena.com/。这将有助于我们更好地理解AI驱动的说服力和其社会影响。
https://arxiv.org/abs/2410.02653
Image quality assessment (IQA) serves as the golden standard for all models' performance in nearly all computer vision fields. However, it still suffers from poor out-of-distribution generalization ability and expensive training costs. To address these problems, we propose Dog-IQA, a standard-guided zero-shot mix-grained IQA method, which is training-free and utilizes the exceptional prior knowledge of multimodal large language models (MLLMs). To obtain accurate IQA scores, namely scores consistent with humans, we design an MLLM-based inference pipeline that imitates human experts. In detail, Dog-IQA applies two techniques. First, Dog-IQA objectively scores with specific standards that utilize MLLM's behavior pattern and minimize the influence of subjective factors. Second, Dog-IQA comprehensively takes local semantic objects and the whole image as input and aggregates their scores, leveraging local and global information. Our proposed Dog-IQA achieves state-of-the-art (SOTA) performance compared with training-free methods, and competitive performance compared with training-based methods in cross-dataset scenarios. Our code and models will be available at this https URL.
图像质量评估(IQA)是几乎所有计算机视觉领域中模型性能的黄金标准。然而,它仍然受到缺乏离散分布泛化能力和昂贵的训练成本的问题。为了应对这些问题,我们提出了Dog-IQA,一种标准的指导下的零样本混合 grain IQA 方法,该方法是训练免费的,并利用了多模态大型语言模型的(MLLM)的异常先验知识。为了获得准确的可解释分数(即与人类一致的分数),我们设计了一个基于 MLLM 的推理管道,模仿了人类专家。具体来说,Dog-IQA 应用了两种技术。首先,Dog-IQA 使用特定的标准客观地评分,这些标准利用了 MLLM 的行为模式,并最小化了主观因素的影响。其次,Dog-IQA 全面接收局部语义对象和整个图像作为输入,并利用局部和全局信息聚合它们的分数。我们提出的 Dog-IQA 在与训练免费方法相比实现了最先进的性能,在与训练基方法在跨数据集场景相比实现了竞争力的性能。我们的代码和模型将在此处 available at this <https:// URL>。
https://arxiv.org/abs/2410.02505
A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
评估LLM能力的一种标准方法是提出一个多选题问题,并选择具有最高逻辑值的选项作为模型的预测答案。然而,这种评估LLM的方式存在局限性,因为即使模型知道正确答案,它可能也会因为遵循这种严格格式而难以简单地选择相应的字母。为了解决这个问题,我们引入了新的分数,更好地捕捉并揭示模型的潜在知识:查询-键分数(QK-score),它是从注意头中查询和键表示之间的交互中得到的,以及基于注意权重的注意力分数。这些分数是从特定的\textit{选择并复制}头中提取的,这些头在流行多选题问题回答(MCQA)数据集中表现出一致的性能。根据这些分数,我们的方法提高了知识提取,使得LLaMA2-7B模型在流行MCQA基准上实现了最高16%的提高,而大模型在流行MCQA基准上实现了最高10%的提高。同时,在简单的合成数据集上,模型的准确度几乎提高了60%,几乎实现了一致性准确率,因此证明了方法在减轻MCQA格式限制方面的效率。为了支持我们的主张,我们在零散和少样本设置上对参数从70亿到700亿进行实验。
https://arxiv.org/abs/2410.02343
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
大语言模型(LLMs)在自然语言理解和生成方面取得了显著的进步,这是由可扩展的预训练和先进的微调推动的。然而,通过人反馈强化学习(RLHF)增强LLMs的推理能力仍然具有挑战性,因为高质量偏好数据的稀缺性使得 annotate变得困难,而且对于奖励模型(RM)微调至关重要。为了减轻这个问题,我们引入了CodePMP,一种可扩展的偏好模型预训练(PMP)流程,它利用来自公开可用高质量源代码的大量合成代码偏好对。CodePMP通过在大型合成代码偏好对上预训练偏好模型,提高了RM微调的效率。我们在数学推理任务(GSM8K,MATH)和逻辑推理任务(ReClor,LogiQA2.0)上评估CodePMP, consistently显示LLM的推理性能显著提高,并强调了可扩展偏好模型预训练对高效奖励建模的重要性。
https://arxiv.org/abs/2410.02229
Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM). Specifically, a cross-modal mining scheme is developed to capture the modality-invariant components as target poisoning areas, where well-designed trigger patterns injected into these regions can be efficiently recognized by the victim models. This strategy is adapted to different image-text cross-modal models, making our framework available to various attack scenarios. Furthermore, for generating poisoned samples of high stealthiness, we conceive modality-specific generators for visual and linguistic modalities that facilitate hiding explicit trigger patterns in modality-invariant regions. To the best of our knowledge, BadCM is the first invisible backdoor method deliberately designed for diverse cross-modal attacks within one unified framework. Comprehensive experimental evaluations on two typical applications, i.e., cross-modal retrieval and VQA, demonstrate the effectiveness and generalization of our method under multiple kinds of attack scenarios. Moreover, we show that BadCM can robustly evade existing backdoor defenses. Our code is available at this https URL.
尽管在单模态学习任务中取得了显著的成功,但跨模态攻击在多个模态情况下仍然没有被深入研究,因为这种攻击在模态间表现有限,且隐身能力较差。值得注意的是,由于该领域的工作主要继承自单模态视觉攻击的思想,因此它们在处理多样跨模态攻击情况和操作不可见的触发样本时遇到困难,这阻碍了它们在现实应用中的实用性。在本文中,我们提出了一种新颖的双边后门填补了跨模态后门中的空白,并针对跨模态学习(BadCM)提出了一个通用的隐身后门框架。具体来说,我们开发了一个跨模态挖掘方案,以捕获这些区域的模态无关组件作为目标中毒区域,在这里设计的触发模式可以被受害者模型高效地识别。这一策略可以应用于不同图像文本跨模态模型,使我们的框架适用于各种攻击场景。此外,为了生成高隐身性的样本,我们为视觉和语言模态设计了特定生成器,以便在模态无关区域隐藏显式触发模式。据我们所知,BadCM 是第一个在统一框架中故意设计用于多样跨模态攻击的隐身后门方法。在两个典型应用领域,即跨模态检索和VQA,我们对我们的方法进行了全面实验评估。此外,我们证明了BadCM可以稳健地躲避现有的后门防御。我们的代码可以从该链接处获取。
https://arxiv.org/abs/2410.02182
Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.
在将大型语言模型部署到生产环境中时,需要同时关注效率和风险控制。先前的研究已经证明了在保持相似准确性的同时削减成本的可能性,但忽视了风险控制。相比之下,我们在这里提出了一种名为多级回避层次模型(HCMA)的方法,它使用模型固有不确定性将查询沿着LLM智能层次结构委托,从而实现仅基于黑盒API调用的模型切换。我们的框架在效率和风险之间取得了新颖的平衡。例如,在允许模型对20%的查询进行 abstain 时,部署HCMA在MMLU上将Llama3 405B的错误率降低了30%。为了对HCMA进行优化性能的校准,我们的方法使用数据有效的逻辑回归(基于简单的非线性特征变换),这只需要50或100个有标签的示例来实现出色的校准误差(ECE),将ECE降低了50% compared to naive Platt scaling。在自由文本生成任务中,我们发现链式思维对选择性预测没有影响,而零 shot提示在高abstraction率时将错误率驱动到0%。随着LLM在具有不同能力的计算环境中(如移动设备、笔记本电脑和云)日益部署,我们的框架为在保持部署效率的同时实现严格的 risk controls 奠定了基础。
https://arxiv.org/abs/2410.02173
Long-Form Question Answering (LFQA) refers to generating in-depth, paragraph-level responses to open-ended questions. Although lots of LFQA methods are developed, evaluating LFQA effectively and efficiently remains challenging due to its high complexity and cost. Therefore, there is no standard benchmark for LFQA evaluation till now. To address this gap, we make the first attempt by proposing a well-constructed, reference-based benchmark named Chinese exAmination for LFQA Evaluation (CALF), aiming to rigorously assess the performance of automatic evaluation metrics for LFQA. The CALF benchmark is derived from Chinese examination questions that have been translated into English. It includes up to 1476 examples consisting of knowledge-intensive and nuanced responses. Our evaluation comprises three different settings to ana lyze the behavior of automatic metrics comprehensively. We conducted extensive experiments on 7 traditional evaluation metrics, 3 prompt-based metrics, and 3 trained evaluation metrics, and tested on agent systems for the LFQA evaluation. The results reveal that none of the current automatic evaluation metrics shows comparable performances with humans, indicating that they cannot capture dense information contained in long-form responses well. In addition, we provide a detailed analysis of the reasons why automatic evaluation metrics fail when evaluating LFQA, offering valuable insights to advance LFQA evaluation systems. Dataset and associated codes can be accessed at our GitHub repository.
长篇问题回答(LFQA)指的是生成针对开放性问题的详细段落级回答。尽管已经开发了许多LFQA方法,但有效地和高效地评估LFQA仍然具有挑战性,因为其复杂性和成本较高。因此,目前尚无针对LFQA评估的标准基准。为了填补这一空白,我们提出了一个构建良好、基于参考的基准,名为“ Chinese exAmination for LFQA Evaluation”(CALF),旨在严格评估自动评估指标对LFQA的性能。CALF基准来源于已翻译成英语的中文考试问题。它包括多达1476个示例,包括知识密集和细微的回答。我们的评估包括三个不同的设置,以全面分析自动评估指标的行为。我们对7个传统评估指标、3个提示性指标和3个训练评估指标进行了广泛的实验,并在LFQA评估的代理系统上进行了测试。结果显示,目前的所有自动评估指标与人类相比都没有表现出可比较的性能,这表明它们无法很好地捕捉长篇回答中的丰富信息。此外,我们详细分析了为什么自动评估指标在评估LFQA时会失败,为改进LFQA评估系统提供了宝贵的见解。数据集和相关的代码可以在我们的GitHub存储库中访问。
https://arxiv.org/abs/2410.01945
Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR's RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.
奖励模型(RMs)在将LLM与人类偏好对齐方面发挥着关键作用,通过在推理或迭代训练过程中对输出进行排序来提高其性能。然而,RM对新技术的泛化程度通常不知道(例如,一些RM在创意写作和数学推理方面表现出色)。因此,在训练LLM时仅使用一个固定的RM可能是不最优的。此外,同时使用多个RM优化LLM可能会导致计算密集型和具有挑战性的问题,因为来自不同RM的冲突信号可能会降低性能。为了应对这些挑战,我们引入了LASeR(学习适应性选择奖励),它使用多个RM轮流训练LLM,为每个实例选择最合适的RM进行排名和生成偏好数据,形成多臂老虎机问题。我们对常识和数学推理任务的研究表明,LASeR可以通过优化多个RM来提高迭代LLM的优化,提高Llama-3-8B在三个数据集上的绝对平均准确度2.67%以上,同时实现优越的训练效率(例如,速度提高2倍)。此外,在WildChat,一个指令跟随提示的基准,我们发现使用Llama-3-8B LASeR的LLM优化导致AlpacaEval获胜率从序列优化多个RM优化中提高了71.45%。在长语境生成任务上,我们发现LASeR在LLM上的平均改进率为2.64 F1和2.42 F1,当使用最佳n采样时,与随机RM选择相比。LASeR对噪声奖励具有鲁棒性,并适用于多种设置。最后,LASeR的RM选择取决于底层任务或实例,并我们证实了多个RM之间存在相互冲突的偏好,这些偏好可以通过LASeR得到缓解。
https://arxiv.org/abs/2410.01735
The various limitations of Generative AI, such as hallucinations and model failures, have made it crucial to understand the role of different modalities in Visual Language Model (VLM) predictions. Our work investigates how the integration of information from image and text modalities influences the performance and behavior of VLMs in visual question answering (VQA) and reasoning tasks. We measure this effect through answer accuracy, reasoning quality, model uncertainty, and modality relevance. We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task. Our contributions include (1) the Semantic Interventions (SI)-VQA dataset, (2) a benchmark study of various VLM architectures under different modality configurations, and (3) the Interactive Semantic Interventions (ISI) tool. The SI-VQA dataset serves as the foundation for the benchmark, while the ISI tool provides an interface to test and apply semantic interventions in image and text inputs, enabling more fine-grained analysis. Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence. Image text annotations have minimal impact on accuracy and uncertainty, slightly increasing image relevance. Attention analysis confirms the dominant role of image inputs over text in VQA tasks. In this study, we evaluate state-of-the-art VLMs that allow us to extract attention coefficients for each modality. A key finding is PaliGemma's harmful overconfidence, which poses a higher risk of silent failures compared to the LLaVA models. This work sets the foundation for rigorous analysis of modality integration, supported by datasets specifically designed for this purpose.
各种生成性 AI 的局限性(如幻觉和模型失败)使得理解不同模态在视觉语言模型(VLM)预测中的作用变得至关重要。我们的工作研究了信息从图像和文本模态的集成如何影响 VLMs 在视觉问答(VQA)和推理任务中的性能和行为。我们通过答案精度、推理质量、模型不确定性和模态相关性来衡量这种影响。我们在不同配置中研究了文本和图像模态之间的相互作用。我们的贡献包括:(1)语义干预(SI)-VQA 数据集,(2)在不同的模态配置下各种 VLM 架构的基准研究,以及(3)交互式语义干预(ISI)工具。SI-VQA 数据集作为基准,而 ISI 工具为在图像和文本输入上测试和应用语义干预提供了界面,使得更细粒度的分析成为可能。我们的结果表明,模态之间的互补信息可以提高答案和推理质量,而矛盾的信息会损害模型的性能和信心。图像文本注释对准确性和不确定性略有增加,但关注度分析证实了图像在 VQA 任务中的重要作用。在本研究中,我们评估了允许我们为每个模态提取注意系数的最先进的 VLM。一个关键发现是 PaliGemma 的有害过度自信,这使得与 LLaVA 模型相比,存在更高的静默故障风险。这项工作为模块集成提供了严谨的分析基础,并得到了专门为此目的设计的数据集的支持。
https://arxiv.org/abs/2410.01690
Self-consistency-based approaches, which involve repeatedly sampling multiple outputs and selecting the most consistent one as the final response, prove to be remarkably effective in improving the factual accuracy of large language models. Nonetheless, existing methods usually have strict constraints on the task format, largely limiting their applicability. In this paper, we present Integrative Decoding (ID), to unlock the potential of self-consistency in open-ended generation tasks. ID operates by constructing a set of inputs, each prepended with a previously sampled response, and then processes them concurrently, with the next token being selected by aggregating of all their corresponding predictions at each decoding step. In essence, this simple approach implicitly incorporates self-consistency in the decoding objective. Extensive evaluation shows that ID consistently enhances factuality over a wide range of language models, with substantial improvements on the TruthfulQA (+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance gains amplify progressively as the number of sampled responses increases, indicating the potential of ID to scale up with repeated sampling.
自一致方法,涉及多次采样多个输出并选择最一致的作为最终响应,证明在提高大型语言模型的事实准确性方面非常有效。然而,现有的方法通常对任务格式有严格的限制,这大大限制了它们的适用性。在本文中,我们提出了整合解码(ID)方法,以释放自一致在开放性生成任务中的潜在潜力。ID通过构建一组输入,每个输入都附带一个之前采样的响应,然后同时处理它们,在解码步骤中下一个单词是由所有相应预测的平均值选出的。本质上,这种简单的方法在解码目标中 implicitly 包含了自一致性。 丰富的评估显示,ID在广泛的 language models 上持续增强事实准确性,在真理 QA (+11.2%)、传记 (+15.4%) 和 LongFact (+8.5%) 基准测试中的改善显著。性能提升会随着采样的响应数量增加而逐渐放大,表明 ID 有通过重复采样扩展其潜力的潜力。
https://arxiv.org/abs/2410.01556
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs), to address this gap. MedQA-CS evaluates LLMs through two instruction-following tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real clinical scenarios. Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations, and providing the quantitative and qualitative assessment of LLMs as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks, MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities for both open- and closed-source LLMs.
人工智能(AI)和大型语言模型(LLMs)在医疗保健领域需要高级临床技能(CS),然而目前的基准评估方法无法全面评估这些技能。我们介绍了一个名为MedQA-CS的AI-SCE框架,灵感来自医学教育的目标结构临床考试(OSCEs),以填补这一空白。MedQA-CS通过两个指令跟随任务对LLMs进行评估,LLM-as-medical-student和LLM-as-CS-examiner,旨在反映真实的临床场景。我们的贡献包括开发了MedQA-CS,一个公开可用数据和专家注释的全面评估框架,以及为LLMs在CS评估中作为可靠评估者的定量定性评估。我们的实验证明,MedQA-CS比传统的多选题QA基准(如MedQA)更具挑战性。与现有基准相结合,MedQA-CS使对LLMs的临床能力进行更全面的评估成为可能,无论是开源还是闭源的LLM。
https://arxiv.org/abs/2410.01553
While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel persona-driven data augmentation technique to enhance the dataset's quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points-merely 17.8% of MetaMathQA and 27% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.
虽然闭源的大型语言模型(LLMs)表现出强大的数学问题解决能力,但开源模型继续在这类任务上挣扎。为了弥合这个差距,我们提出了数据增强方法并引入了PersonaMathQA,一个源自MATH和GSM8K的 dataset,我们对PersonaMath模型进行训练。我们的方法包括两个阶段:第一阶段是基于Persona的多样性学习,第二阶段是基于Reflection学习。在第一阶段,我们使用闭源的LLM生成详细的连锁思考(CoT)解决方案,并引入了一种新的以Persona为中心的数据增强技术,以增强数据集的质量和多样性。在第二阶段,我们将Reflection融入其中,以充分利用更具有挑战性和价值的问题。在MATH和GSM8K上评估我们的PersonaMath模型,结果显示PersonaMath-7B模型(基于LLLaMA-2-7B)在MATH上的准确率为24.2%,在GSM8K上的准确率为68.7%,超过所有基线方法,实现了最先进的性能。值得注意的是,我们的数据集仅包含70.3K个数据点——仅为MetaMathQA的17.8%和MathInstruct的27%,而我们的模型却表现出了这些基线的高质量和多样性,这使得我们的训练数据具有很高的质量和多样性,从而实现更高效的模型训练。我们公开发布PersonaMathQA数据集、PersonaMath模型和我们的代码供公共使用。
https://arxiv.org/abs/2410.01504
Recent advancements in Large Language Models (LLMs) have achieved robust performance across diverse tasks, but fine-tuning these models for specific domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a small subset of parameters. However, existing methods for fusing multiple LoRAs lack dynamic fusion based on contextual inputs and often increase inference time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight Plugin that employs a mini-MLP module with only 5M parameters to dynamically fuse multiple LoRAs at the sentence level using top-p sampling strategies. This approach reduces inference time to less than twice that of single LoRA inference by leveraging parallel computation. Evaluations across 26 tasks-including multiple-choice questions and question answering-demonstrate that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice datasets and significant improvements in BLEU and ROUGE scores on QA datasets, outperforming different LLMs backbones under composite task settings. DLP-LoRA effectively balances performance and efficiency, making it a practical solution for dynamic multi-task adaptation in LLMs. Our code is available at this https URL.
近年来,大型语言模型(LLMs)的进步在各种任务上实现了稳健的性能,但在为特定领域对它们进行微调时仍然具有资源密集性。参数高效的微调方法(PEFT)如低秩适应(LoRA)通过微调少量参数来解决这个挑战。然而,现有的将多个LoRa进行融合的方法缺乏基于上下文输入的动态融合,而且由于词级操作,推理时间往往增加。我们提出了一种DLP-LoRA,一种动态轻量级插件,它采用了一个具有5M参数的小型MLP模块,在句子级别使用顶点抽样策略动态地融合多个LoRa。通过并行计算,这种方法将LLM单次推理时间减少到单次LoRa推理时间的不到两倍。在26个任务-包括多项选择问题和问答-评估中,DLP-LoRA在多项选择数据集上的平均准确度为92.34%,在问答数据集上的BLEU和ROUGE分数显著提高,在复合任务设置下击败了不同的LLM后端。DLP-LoRA有效地平衡了性能和效率,使得它成为LLM中动态多任务适应的一个实际解决方案。我们的代码可在此处访问:https://www.thunberg.tiktok.com/。
https://arxiv.org/abs/2410.01497
Knowledge graph question answering (KGQA) involves answering natural language questions by leveraging structured information stored in a knowledge graph. Typically, KGQA initially retrieve a targeted subgraph from a large-scale knowledge graph, which serves as the basis for reasoning models to address queries. However, the retrieved subgraph inevitably brings distraction information for knowledge utilization, impeding the model's ability to perform accurate reasoning. To address this issue, we propose a Question-guided Knowledge Graph Re-scoring method (Q-KGR) to eliminate noisy pathways for the input question, thereby focusing specifically on pertinent factual knowledge. Moreover, we introduce Knowformer, a parameter-efficient method for injecting the re-scored knowledge graph into large language models to enhance their ability to perform factual reasoning. Extensive experiments on multiple KGQA benchmarks demonstrate the superiority of our method over existing systems.
知识图谱问答(KGQA)是通过利用知识图谱中存储的结构化信息来回答自然语言问题的过程。通常,KGQA从大型知识图中检索一个目标子图,作为推理模型回答查询的基本依据。然而,检索到的子图必然会带来干扰信息,阻碍模型进行准确推理的能力。为了解决这个问题,我们提出了一个问题引导的知识图谱重新评分方法(Q-KGR),以消除输入问题中的嘈杂路径,从而专注于相关的事实知识。此外,我们还引入了Knowformer,一种参数有效的将重新评分知识图谱注入大型语言模型的方法,以增强它们进行事实推理的能力。在多个KGQA基准测试中,我们的方法证明了与现有系统相比具有优越性的特点。
https://arxiv.org/abs/2410.01401
Learning is a key motivator behind information search behavior. With the emergence of LLM-based chatbots, students are increasingly turning to these tools as their primary resource for acquiring knowledge. However, the transition from traditional resources like textbooks and web searches raises concerns among educators. They worry that these fully-automated LLMs might lead students to delegate critical steps of search as learning. In this paper, we systematically uncover three main concerns from educators' perspectives. In response to these concerns, we conducted a mixed-methods study with 92 university students to compare three learning sources with different automation levels. Our results show that LLMs support comprehensive understanding of key concepts without promoting passive learning, though their effectiveness in knowledge retention was limited. Additionally, we found that academic performance impacted both learning outcomes and search patterns. Notably, higher-competence learners engaged more deeply with content through reading-intensive behaviors rather than relying on search activities.
学习是信息搜索行为的關鍵動機。隨著LLM聊天機器人的出現,學生 increasingly 將其作為主要知識獲取資源。然而,從傳統資源如教科書和網絡搜索轉向完全自動化的LLM可能會引起教育者的關注。他們擔心,這些完全自動化的LLM可能會使學生將搜索關鍵步驟視為學習。在本文中,我們從教育者的角度系統地揭示了三個主要問題。為了應對這些問題,我們與92名大學生進行了混合研究,比較了三個具有不同自動化水平的學習資源。我們的結果顯示,LLM支持全面理解關鍵概念,而不會促進消極的學習,但其在知識保留方面的效果有限。此外,我們發現學業成就影響了學習成果和搜索模式。值得注意的是,高能力學習者通過閱讀密集行為與內容建立更深入的關聯,而不是依賴于搜索活動。
https://arxiv.org/abs/2410.01396
There is a need for empathetic and coherent responses in automated chatbot-facilitated psychotherapy sessions. This study addresses the challenge of enhancing the emotional and contextual understanding of large language models (LLMs) in psychiatric applications. We introduce a novel framework that integrates multiple emotion lexicons, including NRC Emotion Lexicon, VADER, WordNet, and SentiWordNet, with state-of-the-art LLMs such as LLAMA 2, Flan-T5, ChatGPT 3.0, and ChatGPT 4.0. The primary dataset comprises over 2,000 therapy session transcripts from the Counseling and Psychotherapy database, covering discussions on anxiety, depression, trauma, and addiction. We segment the transcripts into smaller chunks, enhancing them with lexical features and computing embeddings using BERT, GPT-3, and RoBERTa to capture semantic and emotional nuances. These embeddings are stored in a FAISS vector database, enabling efficient similarity search and clustering based on cosine similarity. Upon user query, the most relevant segments are retrieved and provided as context to the LLMs, significantly improving the models' ability to generate empathetic and contextually appropriate responses. Experimental evaluations demonstrate that in-corporating emotion lexicons enhances empathy, coherence, informativeness, and fluency scores. Our findings highlight the critical role of emotional embeddings in improving LLM performance for psychotherapy.
在自动会话机器人引导的心理治疗会议中需要有富有同情心和连贯性的回答。本研究旨在提高在精神疾病应用中大型语言模型(LLMs)的情绪和上下文理解。我们引入了一个新颖的框架,将包括NRC情感词汇表、VADER、WordNet和SentiWordNet等多种情感词汇与最先进的LLMs(如LLAMA 2、Flan-T5、ChatGPT 3.0和ChatGPT 4.0)集成。主要数据集包括来自咨询和心理治疗数据库的超过2,000个治疗会议记录,涵盖焦虑、抑郁、创伤和成瘾等方面的讨论。我们将录音文本划分为较小的片段,通过词汇特征提高它们,并使用BERT、GPT-3和RoBERTa计算嵌入以捕捉语义和情感细微差别。这些嵌入存储在FAISS向量数据库中,使得基于余弦相似度的搜索和聚类变得高效。当用户提出查询时,最相关的片段被检索并提供给LLMs,显著提高了模型生成体贴和上下文适当的回答的能力。实验评估表明,引入情感词汇可以提高LLM在精神疾病应用中的表现。我们的研究结果突出了情感嵌入在提高LLM精神疾病治疗效果中的关键作用。
https://arxiv.org/abs/2410.01306
The emergence of Vision-Language Models (VLMs) represents a significant advancement in integrating computer vision with Large Language Models (LLMs) to generate detailed text descriptions from visual inputs. Despite their growing importance, the security of VLMs, particularly against backdoor attacks, is under explored. Moreover, prior works often assume attackers have access to the original training data, which is often unrealistic. In this paper, we address a more practical and challenging scenario where attackers must rely solely on Out-Of-Distribution (OOD) data. We introduce VLOOD (Backdooring Vision-Language Models with Out-of-Distribution Data), a novel approach with two key contributions: (1) demonstrating backdoor attacks on VLMs in complex image-to-text tasks while minimizing degradation of the original semantics under poisoned inputs, and (2) proposing innovative techniques for backdoor injection without requiring any access to the original training data. Our evaluation on image captioning and visual question answering (VQA) tasks confirms the effectiveness of VLOOD, revealing a critical security vulnerability in VLMs and laying the foundation for future research on securing multimodal models against sophisticated threats.
视觉语言模型(VLMs)的出现代表了对将计算机视觉与大型语言模型(LLMs)集成以生成详细文本描述视觉输入的显著进步。尽管它们的重要性不断增加,但VLMs的安全性,尤其是对后门攻击的防御,仍缺乏深入研究。此外,之前的 works 通常假设攻击者具有访问原始训练数据的能力,这在实际情况下并不现实。在本文中,我们讨论了一个更加实际和具有挑战性的场景,即攻击者只能依赖外部数据(OD)。我们引入了 VLOOD(在分布式数据上进行后门攻击的视觉语言模型),一种具有两个关键贡献的创新方法:(1)在复杂图像到文本任务中证明对 VLMs 的后门攻击,同时最小化在毒化输入下的原始语义贬损;(2)提出了一种不需要访问原始训练数据的创新后门注入技术。我们对图像标题和视觉问答(VQA)任务的评估证实了 VLOOD的有效性,揭示了 VLMs 中的关键安全漏洞,为未来研究在复杂威胁面前保护多模态模型奠定了基础。
https://arxiv.org/abs/2410.01264