Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
推测解码已成为提高在主机大语言模型中的延迟和吞吐量的强大方法。然而,大多数现有实现都关注生成单个序列。现实世界的生成性人工智能应用程序通常需要多个响应。在批处理设置中,如何进行批处理的推测解码以及如何在保持延迟优势的同时实现出色的GPU利用率仍然具有非 trivial 挑战。本文描述了一个批处理的推测解码系统,在多序列生成延迟方面达到了新的技术水平,并展示了GPU利用率以及在不超过预设时间预算内的生成质量的优越性。例如,在单个A100 GPU上,每个序列的生成平均速度为5.8ms,总吞吐量为1.1K Token/s。这些结果代表了最先进的延迟,以及优化常规解码的2.15倍速度。在预设时间预算内,我们的系统能够生成具有人类评估Pass@First 43%和Pass@All 61%的序列,远远超过了单序列推测解码的可实现水平。在解码过程中,我们的GPU利用率达到最高点,达到15.8%,比普通解码的顶峰高约3倍,比单序列推测解码的顶峰高约10倍。
https://arxiv.org/abs/2404.15778
Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in different medical applications, detailing how they are evaluated based on their performance in tasks such as clinical application, medical text data processing, information retrieval, data analysis, medical scientific writing, educational content generation etc. The subsequent sections delve into the methodologies employed in these evaluations, discussing the benchmarks and metrics used to assess the models' effectiveness, accuracy, and ethical alignment. Through this survey, we aim to equip healthcare professionals, researchers, and policymakers with a comprehensive understanding of the potential strengths and limitations of LLMs in medical applications. By providing detailed insights into the evaluation processes and the challenges faced in integrating LLMs into healthcare, this survey seeks to guide the responsible development and deployment of these powerful models, ensuring they are harnessed to their full potential while maintaining stringent ethical standards.
自2017年Transformer架构的创立以来,大型语言模型(LLMs)如GPT和BERT等在语言理解和生成方面的先进能力显著发展,对 various行业产生了重大影响。这些模型展示出在医疗领域进行变革的潜力,突显了需要专业评估框架以确保其有效和道德部署的必要性。这次全面调查详细探讨了LLMs在医疗保健领域中的应用和评估需求,强调了对这些模型的实证验证以全面发挥其在提高医疗保健成果方面的关键作用的重要性。我们的调查旨在为医疗保健专业人员、研究人员和政策制定者提供全面了解LLM在医疗应用中的潜力和限制的全面理解。通过提供关于这些评估过程和将LLMs整合到医疗保健中的挑战的详细见解,这次调查旨在指导这些强大模型的 responsible development 和 deployment,确保它们在保持严格道德标准的同时充分发挥其全部潜力。
https://arxiv.org/abs/2404.15777
Chain-of-thought responses from language models improve performance across most benchmarks. However, it remains unclear to what extent these performance gains can be attributed to human-like task decomposition or simply the greater computation that additional tokens allow. We show that transformers can use meaningless filler tokens (e.g., '......') in place of a chain of thought to solve two hard algorithmic tasks they could not solve when responding without intermediate tokens. However, we find empirically that learning to use filler tokens is difficult and requires specific, dense supervision to converge. We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula. For problems satisfying this characterization, chain-of-thought tokens need not provide information about the intermediate computational steps involved in multi-token computations. In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
翻译:通过语言模型的连续思考回答可以提高大多数基准测试的性能。然而,尚不清楚这些性能提升是否可以归因于人机类似任务的分层、或是说增加的标记允许的更广泛的计算。我们证明了,Transformer 可以在没有中间标记的情况下,使用无意义的填充标记(例如,'......')代替连续思考来解决它们无法解决的两个困难算法任务。然而,我们通过经验发现,学会使用填充标记是困难的,并且需要特定的、密集的监督来收敛。我们还给出了关于一类问题中填充标记有益于什么的问题的 theoretical 描述,即第一个度量深度的公式。对于满足这种描述的问题,连续思考标记不必提供关于多标记计算中中间计算步骤的信息。总之,我们的结果表明,无需选择标记,额外的标记可以提供计算优势。由于中间标记可以作为填充标记发挥作用,这引发了对大型语言模型从事未经审计、隐藏计算的担忧。
https://arxiv.org/abs/2404.15758
Modular deep learning is the state-of-the-art solution for lifting the curse of multilinguality, preventing the impact of negative interference and enabling cross-lingual performance in Multilingual Pre-trained Language Models. However, a trade-off of this approach is the reduction in positive transfer learning from closely related languages. In response, we introduce a novel method called language arithmetic, which enables training-free post-processing to address this limitation. Inspired by the task arithmetic framework, we apply learning via addition to the language adapters, transitioning the framework from a multi-task to a multilingual setup. The effectiveness of the proposed solution is demonstrated on three downstream tasks in a MAD-X-based set of cross-lingual schemes, acting as a post-processing procedure. Language arithmetic consistently improves the baselines with significant gains in the most challenging cases of zero-shot and low-resource applications. Our code and models are available at this https URL .
模块化深度学习是解决多语言问题的最先进解决方案,可以防止负干扰的影响,并实现跨语言性能。然而,这种方法的一个代价是减少了与相关语言的积极迁移。为了应对这个局限性,我们引入了一种名为语言代数的新方法,它允许无训练的后处理来解决这个问题。受到任务代数框架的启发,我们在语言适配器上进行加法训练,将框架从多任务设置转变为多语言环境。所提出解决方案在基于MAD-X的跨语言方案中的三个下游任务上的有效性得到了说明,充当了一个后处理过程。语言代数在最具挑战性的零 shot 和低资源应用中取得了显著的提高。我们的代码和模型可在此处访问:https://url.com/ 。
https://arxiv.org/abs/2404.15737
Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at this https URL}{this http URL
大规模语言模型在各种任务上的表现令人印象深刻,展示了通过最小演示示例的In-Context学习(ICL)迅速获取新技能的能力。在这项工作中,我们提出了一个全面研究大型多模态模型中多模态ICL(M-ICL)的框架。我们考虑了最好的开源多模态模型(如IDEFICS,OpenFlamingo)以及多种多模态任务。我们的研究揭示了几个值得关注的研究结果:(1)M-ICL主要依赖于文本驱动的机制,对图像模态的影响较小。(2)当使用先进的ICL策略(如RICES)时,M-ICL并不比基于多数投票的简单策略好。此外,我们识别出M-ICL的一些偏见和局限性,这些都应该在部署前进行考虑。代码可在此处访问:<https://this URL>
https://arxiv.org/abs/2404.15736
This report details the development and key achievements of our latest language model designed for custom large language models. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially crafted multilingual tokenizer to enhance stability and performance. Moreover, our robust training framework incorporates advanced monitoring and rapid recovery features to ensure optimal efficiency. Our Wonton 7B model has demonstrated competitive performance on a range of multilingual and English benchmarks. Future developments will prioritize narrowing the performance gap with more extensively trained models, thereby enhancing the model's real-world efficacy and adaptability.GitHub: \url{this https URL}
本报告详细介绍了我们最新的为定制大型语言模型而设计的语言模型的开发关键成就。引入的改进包括一个支持灵活训练数据调整和课程学习的新颖在线数据调度器。模型的架构由最先进的技术 such as Rotary Positional Embeddings, QK-LayerNorm 和专门设计的多语言标记符强化稳定性 and 性能。此外,我们的稳健训练框架包括先进的监控和快速恢复功能,以确保最佳效率。我们的Wonton 7B模型在多语言和英语基准测试中表现出竞争力的性能。未来的发展将优先考虑通过更广泛训练的模型来缩小性能差距,从而增强模型的真实世界效果和适应性。 GitHub:\url{this <https://github.com> URL}
https://arxiv.org/abs/2404.15702
Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address various challenges across diverse domains and tasks involving LLMs. In this paper, we provide a comprehensive survey of Chain-of-X methods for LLMs in different contexts. Specifically, we categorize them by taxonomies of nodes, i.e., the X in CoX, and application tasks. We also discuss the findings and implications of existing CoX methods, as well as potential future directions. Our survey aims to serve as a detailed and up-to-date resource for researchers seeking to apply the idea of CoT to broader scenarios.
链式思考(CoT)是一种广泛采用的提示方法,激发了大型语言模型(LLMs)惊人的推理能力。受到CoT序列思维结构的启发,许多链式思考(CoX)方法为处理各种涉及LLM的多样领域和任务的挑战提供了方法。在本文中,我们对不同上下文中的链式思考(CoX)方法进行了全面的调查。具体来说,我们将它们按照节点分类法进行分类,即CoX中的X,以及应用任务。我们还讨论了现有CoX方法的发现和影响,以及潜在的未来方向。我们的调查旨在为研究人员提供详细的、最新的资源,以便他们把链式思考(CoT)应用到更广泛的场景中。
https://arxiv.org/abs/2404.15676
Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
系统综述法(SR)是软件工程领域(SE)中的一种流行研究方法。然而,进行SR平均需要67周的时间。因此,自动化SR过程中任何步骤都可能减少与SR相关的努力。我们的目标是调查大型语言模型(LLMs)是否可以通过简化摘要,从而加速标题摘要筛选,并自动化标题摘要筛选。我们进行了一项实验,其中人类对20篇具有原始和简化摘要的论文进行了筛选。使用人类筛选者和基于GPT-3.5和GPT-4的LLM进行了相同筛选任务。我们还研究了不同的提示技术(零击(ZS)、一次击(OS)、少量击(FS)和少量击与思考(FS-CoT))是否改善LLM的筛选性能。最后,我们研究了在LLM复制筛选提示的使用是否会导致性能提升。虽然文本简化没有提高筛选者的性能,但减少了筛选所需的时间。筛选者的科学素养和研究者身份预测了筛选绩效。一些LLM和提示组合在筛选任务中表现与人类筛选者相当。我们的结果表明,GPT-4 LLM比其前任GPT-3.5更好。此外,少量击和一次击提示优于零击提示。在筛选过程中使用LLM进行文本简化并没有显著提高人类性能。使用LLM自动进行标题摘要筛选看起来很有前途,但目前的LLM并没有比人类筛选者更准确。为了推荐在SR筛选过程中使用LLM,还需要进行更多的研究。我们建议,未来的SR研究者在SR研究中发布带有筛选数据的复制包,以促进更确凿的尝试使用LLM进行筛选。
https://arxiv.org/abs/2404.15667
Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information and impair the performance of large language models. To tackle this problem, we propose a novel Knowledge Selection of Large Language Models (KS-LLM) method, aiming to identify valuable information from evidence documents. The KS-LLM approach utilizes triples to effectively select knowledge snippets from evidence documents that are beneficial to answering questions. Specifically, we first generate triples based on the input question, then select the evidence sentences most similar to triples from the evidence document, and finally combine the evidence sentences and triples to assist large language models in generating answers. Experimental comparisons on several question answering datasets, such as TriviaQA, WebQ, and NQ, demonstrate that the proposed method surpasses the baselines and achieves the best results.
大语言模型(LLMs)在知识密集型任务中存在幻觉问题,并且当应用于知识密集型任务时,面临着显著的挑战。一个有前途的方法是利用证据文档作为额外的支持知识,这是通过检索或生成获得的。然而,现有的方法直接利用证据文档的整个内容,这可能会引入噪声信息并损害大型语言模型的性能。为解决这个问题,我们提出了一种名为知识选择大型语言模型(KS-LLM)的方法,旨在从证据文档中识别有价值的信息。KS-LLM方法利用三元组有效地选择对回答问题有益的证据句子。具体来说,我们首先根据输入问题生成三元组,然后从证据文档中选择与三元组最相似的证据句子,最后将证据句子和三元组结合以帮助大型语言模型生成答案。在多个问题回答数据集(如TriviaQA、WebQ和NQ)上的实验比较表明,与基线相比,所提出的方法超过了基线,并取得了最佳结果。
https://arxiv.org/abs/2404.15660
Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at this https URL.
近年来,由于从不同角度揭示数据中多个潜在的结构具有可能性,多聚类技术引起了人们的广泛关注。深度多聚类技术的出现显著提高了大型数据集的性能,通过揭示复杂模式和关系。然而,用户通常不需要算法生成的所有聚类,而且确定所需聚类需要对每个聚类结果进行深入的理解。传统上,将用户的感兴趣关键词与相应的视觉组件对齐是具有挑战性的,但多模态和大型语言模型(LLMs)的出现已经开始弥合这一差距。 为了应对无标签目标视觉数据,我们提出了Multi-MaP,一种采用多模态代理学习过程的新型方法。它依赖于CLIP编码器提取连贯的文本和图像嵌入,GPT-4将用户的兴趣组合成有效的文本上下文。此外,参考词约束和概念级别约束旨在根据用户的兴趣学习最优的文本代理。Multi-MaP不仅通过关键词捕获用户的兴趣,而且有助于发现相关的聚类。 我们进行了广泛的实验,结果表明,Multi-MaP在所有基准多聚类视觉任务中均显著优于最先进的方法。我们的代码可在此处访问:https://thisurl.com/。
https://arxiv.org/abs/2404.15655
Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.
近年来,直接使用大型语言模型(LLMs)来评估自然语言问答(QA)模型一直是最可靠的方法。然而,这种方法存在解释性有限、成本高昂和环境危害等问题。为了应对这些问题,我们提出了一种使用软EM的实体驱动答案集扩展方法。我们的方法基于观察到表面形式通常根据实体类型遵循特定模式的结论,将金答案集扩展到包括各种表面形式。实验结果表明,我们的方法在传统评估方法的基础上取得了很大的优势。此外,我们的评估方法的可靠性与基于LLM的评估方法相当,同时提供了高解释性和降低环境危害的优点。
https://arxiv.org/abs/2404.15650
As Large Language Models (LLMs) are increasingly used to automate code generation, it is often desired to know if the code is AI-generated and by which model, especially for purposes like protecting intellectual property (IP) in industry and preventing academic misconduct in education. Incorporating watermarks into machine-generated content is one way to provide code provenance, but existing solutions are restricted to a single bit or lack flexibility. We present CodeIP, a new watermarking technique for LLM-based code generation. CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code, improving the strength and diversity of the inerseted watermark. This is achieved by training a type predictor to predict the subsequent grammar type of the next token to enhance the syntactical and semantic correctness of the generated code. Experiments on a real-world dataset across five programming languages showcase the effectiveness of CodeIP.
随着大型语言模型(LLMs)越来越多地用于自动编程,了解生成的代码是否由人工智能生成以及由哪个模型生成,尤其是在保护工业知识产权(IP)和防止教育领域的学术不端行为方面,往往具有很高的需求。将水印嵌入到由LLM生成的内容中是提供代码可信的一种方式,但现有的解决方案局限于单个比特或缺乏灵活性。我们介绍了一种新的基于LLM的代码水印技术——CodeIP。CodeIP允许在保留生成的代码语义的同时插入多比特信息,从而提高互水印的强度和多样性。这是通过训练一个类型预测器来预测下一个标点符号的语法类型来实现的,从而增强生成代码的语义和语法正确性。在五个编程语言的实际世界数据集上进行实验,展示了CodeIP的有效性。
https://arxiv.org/abs/2404.15639
In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable, often fall short when faced with the complexity and dynamism of modern business data. Conversely, Artificial Intelligence (AI) models, particularly Large Language Models (LLMs), offer significant potential in pattern recognition and predictive analytics but can lack the precision necessary for specific business applications. This paper explores the efficacy of hybrid approaches that integrate the robustness of rule-based systems with the adaptive power of LLMs in generating actionable business insights.
在商务数据分析领域,从庞大的且多样的数据中提取可操作的见解对于明智的决策和保持竞争优势至关重要。虽然传统基于规则的系统是可靠的,但面对现代商业数据的复杂性和动态性时,常常力不从心。相反,人工智能(AI)模型,特别是大型语言模型(LLMs),在模式识别和预测分析方面具有显著潜力,但也可能缺乏特定业务应用程序所需的精度。本文探讨了将基于规则的系统的稳健性与LLM的适应性相结合产生可操作商业见解的混合方法的有效性。
https://arxiv.org/abs/2404.15604
Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction. ImplicitAVE, sourced from the MAVE dataset, is carefully curated and expanded to include implicit AVE and multimodality, resulting in a refined dataset of 68k training and 1.6k testing data across five domains. We also explore the application of multimodal large language models (MLLMs) to implicit AVE, establishing a comprehensive benchmark for MLLMs on the ImplicitAVE dataset. Six recent MLLMs with eleven variants are evaluated across diverse settings, revealing that implicit value extraction remains a challenging task for MLLMs. The contributions of this work include the development and release of ImplicitAVE, and the exploration and benchmarking of various MLLMs for implicit AVE, providing valuable insights and potential future research directions. Dataset and code are available at this https URL
目前的数据集(AVE)主要关注显式属性值,而忽略了隐含属性值,缺乏产品图像,通常不公开提供,并且不同领域的隐含属性值缺乏深入的人检查。为了应对这些限制,我们提出了ImplicitAVE,第一个公开可用的多模态数据集,用于隐含属性值提取。ImplicitAVE来自MAVE数据集,经过精心挑选和扩展,包括隐含AVE和多模态,从而形成了一个五个领域的精炼数据集,包括68k个训练数据和16k个测试数据。我们还探讨了将多模态大型语言模型(MLLMs)应用于隐含AVE的应用,为MLLMs在ImplicitAVE数据集上建立了全面的基准。六种最近的多模态大型语言模型(MLLMs)在各种设置中进行了评估,揭示了MLLMs在隐含AVE方面的挑战仍然存在。本工作的贡献包括ImplicitAVE的开发和发布,以及探索和评估各种MLLMs用于隐含AVE,为未来的研究提供了宝贵的见解和可能的研究方向。数据集和代码都可以在上述链接中找到。
https://arxiv.org/abs/2404.15592
General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open question. In this work, we focus on a specific use case, pharmaceutical manufacturing investigations, and propose that leveraging historical records of manufacturing incidents and deviations in an organization can be beneficial for addressing and closing new cases, or de-risking new manufacturing campaigns. Using a small but diverse dataset of real manufacturing deviations selected from different product lines, we evaluate and quantify the power of three general purpose LLMs (GPT-3.5, GPT-4, and Claude-2) in performing tasks related to the above goal. In particular, (1) the ability of LLMs in automating the process of extracting specific information such as root cause of a case from unstructured data, as well as (2) the possibility of identifying similar or related deviations by performing semantic search on the database of historical records are examined. While our results point to the high accuracy of GPT-4 and Claude-2 in the information extraction task, we discuss cases of complex interplay between the apparent reasoning and hallucination behavior of LLMs as a risk factor. Furthermore, we show that semantic search on vector embedding of deviation descriptions can be used to identify similar records, such as those with a similar type of defect, with a high level of accuracy. We discuss further improvements to enhance the accuracy of similar record identification.
近年来,通用大型语言模型(LLM)如生成预训练Transformer(GPT)和大语言模型元AI(LLaMA)引起了广泛关注。这些模型在各种自然语言处理任务中的表现确实非常出色。然而,如何将它们应用于领域特定应用场景并实现价值仍然是一个未解之谜。在这项工作中,我们关注一个具体的应用场景,即药品制造业调查,并提出利用组织历史记录的制造事件和偏差有益于解决和关闭新案件,或降低新生产活动的风险。使用来自不同产品线的真实制造偏差的小而多样的数据集,我们评估并量化三种通用LLM(GPT-3.5,GPT-4和Claude-2)在执行与上述目标相关的任务的功率。 特别是,我们检查了LLM在提取特定信息,如案件根原因,以及通过数据库执行语义搜索来识别类似或相关偏差的可能性。虽然我们的结果表明GPT-4和Claude-2在信息提取任务中的高准确性,但讨论了LLM似乎推理和幻觉行为的复杂相互作用作为风险因素。此外,我们还证明了通过向偏差描述的向量嵌入进行语义搜索可以用来识别具有相似类型的缺陷的类似记录,具有很高的准确性。我们进一步讨论了提高类似记录识别准确性的改进措施。
https://arxiv.org/abs/2404.15578
Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.
尽管在长上下文语言模型方面已经取得了最近的进展,但如何让基于Transformer的模型在长上下文中检索到相关信息仍然是一个难以解决的问题。本文旨在回答这个问题。我们对一系列模型进行系统性的调查,揭示了检索头部的特殊性质。我们称之为检索头部。我们确定了检索头部的有趣特性:(1)普遍:所有具有长上下文能力的模型都有检索头部;(2)稀疏:只有不到5%的注意力头部是检索头部。(3)内在:预训练模型中已经存在检索头部。在通过持续预训练扩展上下文长度时,仍然是相同的检索头部执行信息检索。(4)动态激活:以Llama-2 7B为例,即使上下文变化,12个检索头部始终关注所需信息。其余的检索头部则在不同的上下文中激活。 (5)因果:完全删除检索头部会导致无法检索到相关信息,并导致幻觉,而删除随机非检索头部则不会影响模型的检索能力。我们进一步表明,检索头部强烈影响链式推理(CoT)推理,即模型需要经常回顾问题及其先前的上下文。相反,使用模型自身的知识直接生成答案的任务对遮盖检索头部的影响较小。这些观察结果共同解释了模型从输入词中寻求信息的部分。我们相信,我们的见解将促进未来研究在减少幻觉、提高推理和压缩KV缓存方面取得进一步进展。
https://arxiv.org/abs/2404.15574
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.
临床试验匹配的任务是确定可能符合条件的患者。通常,这项任务费力且需要对患者的电子病历(EHR)与临床试验的严格纳入和排除标准进行详细验证。这个过程是手动、时间密集且难以扩展的,导致许多患者错过了潜在的治疗选择。近年来,大型语言模型(LLMs)的进步使得自动化患者-试验匹配成为可能,正如多个同时研究论文所展示的那样。然而,现有方法仅限于受限的、通常是由合成数据集,这些数据集并不能充分反映现实医学数据的复杂性。在本研究中,我们首次完成了针对现实世界EHR的大型规模实证评估,对临床试验匹配。我们使用专有的LLM进行了实验,包括GPT-4和GPT-3.5,以及我们自定义的微调模型OncoLLM,并证明了OncoLLM在显著较小的规模下不仅超过了GPT-3.5,而且其性能甚至超过了合格的医生。所有实验都是在包括美国单个癌症中心在内的现实世界EHR上进行的。
https://arxiv.org/abs/2404.15549
This paper presents BattleAgent, an emulation system that combines the Large Vision-Language Model and Multi-agent System. This novel system aims to simulate complex dynamic interactions among multiple agents, as well as between agents and their environments, over a period of time. It emulates both the decision-making processes of leaders and the viewpoints of ordinary participants, such as soldiers. The emulation showcases the current capabilities of agents, featuring fine-grained multi-modal interactions between agents and landscapes. It develops customizable agent structures to meet specific situational requirements, for example, a variety of battle-related activities like scouting and trench digging. These components collaborate to recreate historical events in a lively and comprehensive manner while offering insights into the thoughts and feelings of individuals from diverse viewpoints. The technological foundations of BattleAgent establish detailed and immersive settings for historical battles, enabling individual agents to partake in, observe, and dynamically respond to evolving battle scenarios. This methodology holds the potential to substantially deepen our understanding of historical events, particularly through individual accounts. Such initiatives can also aid historical research, as conventional historical narratives often lack documentation and prioritize the perspectives of decision-makers, thereby overlooking the experiences of ordinary individuals. BattelAgent illustrates AI's potential to revitalize the human aspect in crucial social events, thereby fostering a more nuanced collective understanding and driving the progressive development of human society.
本论文介绍了一种名为BattleAgent的模拟系统,结合了大型视觉语言模型和多智能体系统。这个新系统旨在模拟多个代理之间以及代理和环境之间的复杂动态互动。它模拟了领导者的决策过程以及普通参与者的观点,例如士兵。模拟展示了代理的当前能力,其中包括代理和环境之间的精细多模态交互。为了满足特定的情景需求,例如战斗活动,如侦察和挖战壕,该系统开发了可定制的代理结构。这些组件协同工作,以生动且全面的方式重新创建历史事件,同时提供对不同观点个体思维和情感的洞察。BattleAgent的技术基础为历史战斗建立了详细和沉浸式的场景,使个体代理能够参与、观察并动态地响应 evolving battle scenarios。这种方法论有潜力实质性加深我们对历史事件的了解,特别是通过个人的口述。这些举措还可以促进历史研究,因为传统历史叙事通常缺乏资料,并优先考虑决策者的观点,从而忽视了普通个体的经历。BattleAgent展示了AI在关键社会事件中恢复人类方面的潜力,从而推动了更加复杂集体理解和人类社会的持续进步。
https://arxiv.org/abs/2404.15532
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at this https URL.
近年来发展的大型语言模型(LLMs)在各种语言理解任务上的表现确实非常出色。但是,它们能否真正“理性”地处理自然语言呢?这个问题受到了大量的研究关注,并且已经研究了很多推理技能,如常识、数量推理和定性推理。然而,关于“逻辑推理”这一关键技能,迄今为止仍缺乏深入的研究。现有工作在研究LLMs的推理能力时,仅关注了命题和第一级逻辑的几个推理规则(如模态推理和推论规则)。为解决这一局限,我们全面评估了LLMs在25个不同的推理模式上的逻辑推理能力,这些模式跨越了命题、第一级逻辑和非规范逻辑。为了进行系统性的评估,我们引入了LogicBench,这是一个关注使用单一推理规则的自然语言问题回答数据集。我们使用包括GPT-4、ChatGPT、Gemini、Llama-2和Mistral在内的各种LLM,使用连锁思考提示进行详细分析。实验结果表明,现有LLM在LogicBench的表现不佳;尤其是,它们在涉及复杂推理和否定实例时表现不佳。此外,它们有时会忽视推理所需的语言上下文信息。我们认为,我们的工作及其成果有助于未来研究为评估和提高LLMs的逻辑推理能力提供方向。数据和代码可在此https://www.academia.edu/39411041/Logic_Reasoning_for_LLMs_LogicBench_and_Beyond_with_Chain_of_Thought_Prompting_Towards_a_Systematic_Evaluation
https://arxiv.org/abs/2404.15522
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
组合图像检索(CIR)是一个根据给定文本修改查询图像的任务。目前的方法依赖于有标签的三元组来训练CIR模型,这些三元组并不像简单的图像-文本对那么常见,从而限制了CIR的广泛应用和其可扩展性。另一方面,零散式CIR可以通过图像描述性对无监督训练进行相对容易的实现,但不考虑图像之间的关系,这种方法往往导致较低的准确性。我们提出了一种新的半监督CIR方法,其中我们在辅助数据中寻找参考图像及其相关目标图像,并使用基于大型语言模型(VDG)生成描述两个视觉差异(即视觉差)的文本。VDG,拥有流畅的语义知识,且对模型无依赖,可以生成伪三元组来提高CIR模型的性能。我们的方法显著提高了现有监督学习方法,并在CIR基准测试中实现了最先进的性能。
https://arxiv.org/abs/2404.15516